Tell HN: GpuOwl/PRPLL, GPU software used to find the largest prime number

72 points by mpreda 9 months ago

Hi, I'm Mihai Preda the author of GpuOwl/PRPLL [1], an OpenCL software used by Luke Durant for his recent discovery of the largest prime number know, the 52nd Mersenne prime 2^136279841 - 1 [2].

Feel free to ask questions about technical aspects of the GpuOwl implementation, about optimizations, tricks, efficient FFT implementation on GPUs etc. Or anything else.

[1] GpuOwl: https://github.com/preda/gpuowl [2] GIMPS: https://www.mersenne.org/

oxxoxoxooo 9 months ago

Hi! Please, I also have a few questions:

1. I guess the most time consuming part is multiplication, right? What kind of FFT do you use? Schönhage-Strassen, multi-prime NTT, ..? Is it implemented via floating-point numbers or integers?

2. Not sure if you encountered this, but do you have any advice for small mulmod (multiplication reduced by prime modulus)? By small I mean machine-word size (i.e. preferably 64-bits).

3. For larger modulus, what do you use? Is it worth precomputing the inverse by, say, Newton iteration or is it faster to use asymptotically slower algorithms? Do you use Montgomery representation?

4. Does the code use any kind of GCD? What algorithm did you choose?

5. Now this is a bit broad question, but could you perhaps compare the traditional algorithms implemented sequentially (e.g. GMP) and algorithm suitable to run on GPUs? I mean, does it make sense to use, say, a quadratic algorithm amenable to parallel execution, rather than a asymptotically faster (and sequential) algorithm?

mpreda 9 months ago

3. Answered in point 1., we use IBDWT and get modular reduction for free through the circular convolution. This works nicely for Mersenne modulus.
4. GCD is not used in PRP, but it is used in P-1 (Pollard's P-1 algo). We use GMP GCD on the CPU (as it's a very rare operation, and GMP/CPU is fast enough). I understand the complexity of the GCD as implemented in GMP is logarithmic which is good.
5. For our dimension it does not make sense to use a quadratic algo instead of a NlogN one; We absolutely need the NlogN provided by convolution/FFT.
mpreda 9 months ago
2. This was discussed to some length over the years on mersenneforum.org [1]. There is a lot of wisdom stored there but hard to find, and many smart & helpful guys, so feel free to ask there. This is an operation of interest because it's:
```
   - involved in TF (Trial Factoring), including on GPUs
   - involved in NTT transforms ("integer FFTs")
```
[1] http://mersenneforum.org/
mpreda 9 months ago
1. Yes, the core of the algorithm is the modular squaring. The squaring is similar to a multiplication, of course. In general, the fast multiplication is implemented via convolution, via FFTs which results in a N x log(N) time complexity of the multiplication.
What we do is modular squaring iterations:
x := x^2 mod M,
where M== 2^p - 1, i.e. M is the Mersenne number that we test.
Realize that working modulo 2^p - 1 means that 2^p == 1, which corresponds to a circular convolution of size p bits. We use the "Irrational Base Discrete Weighted Transform", IBDWT [1] introduced by Crandall/Fagin to turn this into a convolution of a convenient size N "words", where each word contains about 18bits, so Words ~= p/18. For example our prime of interest M52 was tested with a FFT of size 7.5M == 1024 * 15 * 512.
The FFT is a double precision (FP64) floating point FFT. Depending on the FFT size we can make use of about 18bits per FFT "word", where a "word" corresponds to one FP64 value.
Some tricks involved up to this point are: one FFT size halving and the modular reduction for free because of IBDWT. Another FFT size halving because turning the real input/output values into complex numbers in the FFT.
The FFT implementation that we found appropriate for GPUs is the "matrix FFT", which splits the FFT of size N=A*B into sub-FFTs of size A, one matrix multiplication with about A*B twiddle factors, and sub-FFTs of size B. In practice we split the FFT into three dimensions, e.g. for M52 we used: 7.5M == 1024 * 15 * 512.
We implement in a workgroup one FFT of size 1024 or 512. These are usually base-4 FFTs, with transpositions using LDS (Local Data Share, local per-workgroup memory in OpenCL).
The convolution is formed of:
```
   - forward FFT
   - element-wise multiplication
   - inverse FFT
```
After the inverse FFT, we also need to do Carry propagation which properly turns the convolution into a multi-word multiplication.
For performance we merge a few logical kernels that are invoked in succession into a single big kernel, where possible. The main advantage of doing so is that the data does not need to transit through "global memory" (VRAM) anymore but stays local to the workgroup, which is a large gain.
So, to recap:
```
   - multiplication via convolution
   - convolution via FP64 FFT, achieving about 18bits per FP64 word
   - modular reduction for free through IBDWT
```
[1] https://en.wikipedia.org/wiki/Irrational_base_discrete_weigh...*
- oxxoxoxooo 9 months ago
  
  Thank you very much for the answers, very informative!
  And congratulations on the discovery!

motorolnik 9 months ago

Hi, I've got few questions:

1). What profiling tools do you use for GPU code?

2). Where one would start, in terms of learning resources, about coding using inline GPU assembler?

3). Do you verify GPU assembler generated by a compiler from C/C++ code, in terms of effectiveness? If so, which tools do you use for that?

4). Is SIMD on GPUs a thing?

5). What are the primary factors being taken into account by you (cache sizes, microoptimizations, etc.) when you write code for a tool like gpuowl/prpll? Which factor is the most important? Thanks!

mpreda 9 months ago

1. My profiling is rudimentary but effective. I measure per-kernel execution time with OpenCL events (which register with high accuracy start/end times w. practically no overhead), and also I continously measure per-iteration time by dividing wall-time for blocks of 20'000 iterations by that nb. These measuremens are consistent and sensitive.
2. I'm not aware of good learning resources. Explore existing such code, e.g. opencl miners tend to use asm. Read in amdgpu/ in LLVM. Disassemble code from OpenCL and read the ISA. Explore and experiment, but it's tedious. I would not recommend to jump into ISA initially. BTW AMD does have good GCN ISA docs available online, that is useful!
3. Yes I often read the compiled ISA, and over time I discover bugs and also better understand the ISA.
4. OpenCL is SIMD, and yes it matches the GPU HW.
5. most important is to reduce the number of registers used (#VGPRs), as that influences heavilly the occupancy of the kernel. Use fewer costly instructions such as FP64 mul/FMA. Sequential memory access, and in general reduce global memory access as it's very slow. Merge small kernels into one (keep the data in the kernel). Never spill VGPRs.
- mpreda 9 months ago
  
  My above answer was typed on a mobile phone while travelling, so it was maybe exceedingly brief. But now, on a real keyboard, I can go into more detail on any point if there's interest.
motorolnik 9 months ago

And another more general question: (6) gcc, clang, and nvcc have some OpenMP offloading capabilities which allow to compile code into binaries which can then run on GPUs. Is the code they produce through OpenMP anywhere close to what one gets directly with i.e. opencl?
- kopadudl 9 months ago
  
  Using OpenMP with the GPU may be fine depending on the problem, but you cant explore the full GPU potential. Parallelizing the loops on the GPU may be sufficient, but when it is not you have to dig deeper.
- mpreda 9 months ago
  
  I don't know, I haven't eplored OpenMP myself.. maybe some day.

mpreda 8 months ago

Thank you all for the questions! This was basically my first submission on HN, I'm still learning how to do things around here, but the overall tone was gentle and encouraging. And my main take-away was that I need to make the software more user-friendly in order to help potential new users try it out -- I'll work on that.

dgacmu 9 months ago

First, congrats! Awesome work and appreciate you sharing more.

Second: I'm confused by something in your readme. It says:

> For Mersenne primes search, the PRP test is by far preferred over LL, such that LL is not used anymore for search.

But later notes that PRP is computationally nearly identical to LL. Was that sentence supposed to say TF and P-1 instead of PRP or am I misunderstanding something about the actual computational cost of PRP?

cbright 9 months ago

The PRP test has the same computational cost as an LL test. The reason why GIMPS now prefers to do PRP tests instead of LL tests is because an efficiently verifiable proof-of-work certificate was developed for PRP tests [1].
[1] https://doi.org/10.4230/LIPIcs.ITCS.2019.60
- mpreda 9 months ago
  
  Yes. In fact the transition from LL to PRP took place in two steps, at different moments in time.
  We used to use the LL test because the LL result is a bit stronger than the PRP result, LL stating that the number is prime, while PRP saying only that it is likely prime. This is the reason LL is still used as an after-test following any successful PRP discovery, as it happened for the most recent M52 as well.
  The first transition from LL to PRP happened because a very strong and cheap error-checking algorithm, that we call "the Gerbicz error check", was discovered by Robert Gerbicz. This error-check in its most efficient form only works for PRP not for LL. This error-check allows to verify the correctitude of the computation, as it progresses on the GPU, with high confidence and low overhead. It does protect against a lot of HW errors originating from e.g. the GPU VRAM overheating, the GPU having been under-volted too aggressively, bad VRAM; but also from SW bugs and from FFT precision issues.
  As the test of a single exponent takes a long time (let's say 24h on a fast GPU), having confidence that this long computation is proceeding along correctly instead of wasting cycles is a great benefit from the error-check.
  The second step of the transition from LL to PRP happened when the PRP proof was introduced, following on the ideas from the VDF (Verifiable Delay Function) article, which allowed to verify cheaply that a PRP test was indeed executed correcty. This eliminated the need for the Double Check (DC) which was standard procedure with the LL test; practically speeding the process up with 100%.
- dgacmu 9 months ago
  
  Ah, that's interesting and makes sense. Thank you!

luizfzs 9 months ago

Hello!

I came across this new paper, INTEGER PARTITIONS DETECT THE PRIMES [1] from July, 2024, but I don't have enough knowledge to even read it. I wonder if an implementation of this method would provide any speed benefits compared to PRP.

Great work!

[1] https://arxiv.org/pdf/2405.06451

mpreda 8 months ago

Sorry, I don't know, and I don't even have an oppinion on the paper yet.
PRP is a pretty efficient test though, I would consider it a breakthrough for anything to improve on the efficiency of PRP for mersenne candidates.

kristopolous 9 months ago

The binary of this number is over 16MB of 1s. that's nuts.

mpreda 9 months ago

What's nuts is how fast you can square such a number on a GPU!
A number of 136M bits (136 Mega bits), using a 7'500'000-points FFT, can be squared and mod-reduced (modular reduction) in less than 1ms (one milli-second) on consumer-priced (less than $500) GPUs.
- kristopolous 9 months ago
  
  Really? What on earth... I was trying to guess the number as I was reading your post and I was thinking "a few seconds". I'll go try it later today.

rigmonger 9 months ago

First of all, thank you for your work and congratulations on your achievements, both in the search for Mersenne primes and software development.

I am contributing to GIMPS with 2 Radeon Pro VII cards. I'm wondering what will happen when ROCm stops supporting these GPUs.

Do you have any plans to keep them working with GPUOwl/Prpll when they are no longer supported by ROCm?

mpreda 9 months ago

IF ROCm stops supporting Radeon Pro VII, the first solution is to stay on the most recent ROCm that still supports them.
Second, "does not support anymore" does not necessarily mean that it stops working on the old HW, but it could mean that new features/extension aren't implemented for the old HW anymore, and we may not care about those.
Third, AMD does contribute and integrates changes with upstream LLVM. This open-source work could be used by third parties (with significant effort I assume) to continue support.

amuresan 9 months ago

Hi Mihai! This is impressive work!

Are you aware of any other computational maths problems where a sufficiently motivated amateur could make an improvement on the state of the art?

mpreda 8 months ago

> Are you aware of any other computational maths problems where a sufficiently motivated amateur could make an improvement on the state of the art?
Unfortunatelly no, there's nothing I can think of, but that's clearly because I don't know what's out there.
If there's something you'd like to do, focus on something small/simple at first, get it done, and iterate from there.

mpreda 9 months ago

Some topic ideas:

  - Why use OpenCL when implementing GPU software
  - Does it run on AMD or on Nvidia GPUs?
  - How does the primality test implemented in GpuOwl work?
  - How fast is it to test a Mersenne candidate?
  - Why use FFTs? how large are the FFTs?
  - What do you use for sin/cos?

latchkey 9 months ago

It definitely runs on our AMD MI300x. But, the documentation is pretty fragmented and requires a bunch of math knowledge that I don't have, so I'm not really sure how to run it. Just some proof of working...
https://x.com/HotAisle/status/1848780396609106359
If someone can come up with a way to perf test this against an H100, hit me up! It seems like something that could make a fun competition given the use of OpenCL. =)
- mpreda 9 months ago
  
  Point taken. I need to improve the documentation and make it easier to start with.
  There is a lot of documentation and HowTos on the Mersenne Forums [1] where experienced users help newcomers, and that relieves effort from myself.
  [1] http://mersenneforum.org/
  
  latchkey 9 months ago
  
  Thanks. Just some feedback. Ideally, I'd download something, compile it and then be able to just run the binary and have it start up and do something. Right now, it does nothing. Ideally, the app could even connect to some central server to get work, and just start chugging along doing whatever it needs to do.
  Heading to the forums, it is a mess of what looks like decades of information. Tons of it outdated. Much of it heavy in math terms, I don't know anything about. I wish I did! I wish I was smarter! Following the "follow this first" is just a whole bunch of random information.
  If you're looking for people to throw compute at the problem, you need to cater to the LCD of people like me who have tons of compute, but don't have the time to get a math degree or pile through forums for information.
  Imagine if in order to use a web browser, you needed to understand every single underlying protocol first.
  
  mpreda 8 months ago
  
  No, you don't have to understand the algorithms in order to use the software.
  But I understand your feedback. I put it on my list to make it really easy to see the software running once you have the executable.
  There is one little complication though -- you do need a working OpenCL install in order to use GpuOwl. For example, for AMD GPUs, you'd need to install ROCm. On Nvidia GPUs you'd need a working install of CUDA.
  
  latchkey 8 months ago
  
  By default, we provision our machines for our customers, with Ubuntu and the latest version of ROCm.
  The software compiled cleanly and easily, that part is very well done.
  The flags on the software though are greek to me.

iyn 9 months ago

Wow, congrats!

Indeed, I’m curious why you’ve used OpenCL. And what was the hardware/general setup used for finding the prime?

What was your motivation behind building this software?

mpreda 9 months ago

OpenCL works on both AMD and Nvidia GPUs with mostly the same source code. By supporting at-runtime compilation it allows a lot of code particularization/instantiation before compilation, which reduces the power (cost) of the generated code. In general OpenCL is close enough to the HW and the generated code is improving over time (LLVM).
Motivation: a long time ago I had an AMD GPU and no way to run an LL test on it, so I decided to write my own. And I was hooked by the power of the GPU and the quest for ever more efficient, faster implem.
mpreda 9 months ago

The HW setup for finding the prime was Nvidia and AMD GPUs with good FP64 in the cloud, using "spot" instances for better price. This allowed scaling up quickly to many GPUs, and it did have a significant cost.
My personal setup is 8x Radeon Pro VII which also provide heating during the cold season. During summer the effort is in removing the excess heat, and the GPUs run in a reduced-power mode (slower & more efficient).

DeathArrow 9 months ago

Why do you use OpenCL instead of CUDA?

mpreda 9 months ago

Indeed CUDA is nice due to the way it uses C++, integrates host and GPU code in a single file, and in the convenience of compilation. Basically I think CUDA is a bit easier to start with than OpenCL.
OTOH CUDA only works on Nvidia, and that's a major limitation.
GpuOwl uses heavily FP64 ("double" floating point), and FP64 is more readily available at consumer prices on AMD GPUs. We (the GIMPS project) use a lot of Radeon VII and Radeon Pro VII GPUs, which have great FP64 at a cheap price (I am personally running 8x Radeon Pro VII that I bought new for about $300 a piece).
So you see, for us AMD GPUs are the first citizen. Of course I want to support Nvidia GPUs as well, and OpenCL allows that. Luke Durant did run GpuOwl on a lot of Nvidia GPUs in the cloud, and I'm happy GpuOwl did work well for him on Nvidia.
- Resolver 9 months ago
  
  Are there any potential benefits of using CUDA instead of OpenCL on Nvidia GPUs? Like, better driver support, ability to utilize Nvidia-specific features?
  Nvidia A100 GPU which was used to find a new Mersenne prime has specialized dedicated hardware like tensor cores, which on A100 can work not only for FP16 and FP32 but also for FP64. Are there any benefits of utilizing this capabilities?
  
  mpreda 8 months ago
  
  Yes I expect there may be some micro-optimizations that are available on CUDA, such as using bits of PTX in places.
  And if the GPU provides some sort of matrix-multiplication on FP64, that we're not currently making use of -- clearly that would be a big opportunity.
  But somebody needs to implement it, profile, test.. on some HW.
- DeathArrow 9 months ago
  
  Thank you. It makes sense to use OpenCL if you have AMD GPUs in mind.
  I thought though that prospective HPC users have more Nvidia A100 and H100 in mind when buying hardware.
  
  evanb 9 months ago
  
  GIMPS is not typically targeting HPC, it is typically targeting hobbyists who have spare cycles to burn.

primecurious 9 months ago

I'd also like to draw attention that a lot of this work was sponsored by IMC the market maker, Mihai's employer.

mpreda 9 months ago

What! This is absolutely not true. My open source work was not sponsored by anyone. And IMC is not my employer. But really, how did you get this idea?
- mpreda 9 months ago
  
  "primecurious", who you are and what is the purpose of such statements? how would you know who is or isn't sponsoring my work?
  But just to set it straight, GpuOwl received exactly $0 contributions or sponsoring from exactly nobody. It's a pleasure work from my side, and it's open sourced for the easy access of curious minds to the algorithms and techniques implemented. I did receive great help, in the form of source-code contributions, most importantly from George Woltman.