Great post! I've been looking to get into the guts of large scale model training (I'm half-way between the design and application layer of LLMs, mostly in python, sometimes a bit of c++) and this will be a great reference to have.
PS. appreciate it if anyone can recommend more material like this
Hi, I'm the author. Thanks for sharing, was great to wake up to my blog post on the front page! Would love to hear any feedback or if I missed anything.
You can probably get this to run slightly faster using CUDA graphs, which should reduce the amount of communication between the GPU and CPU. I notice that llama.cpp uses them, but glancing at your blog post, it appears that you do not.
You might want to consider switching to bfloat16. The model weights are published as bfloat16. LLMs are more sensitive to the exponent bits than the mantissa bits, so using fp16 over bfloat16 should slightly hurt the model’s intelligence. Using bfloat16 would require you to use the tensor cores instead of the CUDA cores. Interestingly, llama.cpp performs very poorly when asked to use bfloat16, so performance comparisons of a bfloat16 would be best done against llama.cpp using fp16, since there is no reason for performance of fp16 and bfloat16 to be different, other than a lack of optimization.
Using the tensor cores should not make your forward pass any faster despite the additional computational power since the GEMV operations that your forward pass uses should be memory bandwidth bound, although if you ever decide to try doing multiple queries in parallel, you will see a performance boost from being able to use that additional compute via GEMM operations.
If you write a prompt processing version of the forward pass, you can use GEMM operations to process all prompt tokens simultaneously even without doing multiple queries in parallel. This would give you fast prompt processing speeds and open another front on which you can try to outdo llama.cpp.
If you ever want to do this in pure C, I show how here:
https://news.ycombinator.com/item?id=42432802
Technically, the device side code is in CUDA C, which is really C++, but no one has written a C compiler that can compile a lightly extended version of ANSI C to PTX, so the device code must be C style CUDA C there, CUDA FORTRAN or hand written PTX assembly. I went with what was easiest, which was C style CUDA C.
I tried this (and backprop) in GPGPU compute shaders once. One issue that really tripped me up is matrix multiplication along dimensions that are not powers of 2. It will take me awhile to learn to read CUDA code -- is this just an easier thing to do in CUDA or did you figure out a good trick for it?
I know a trick, although it relies on the closed source cuBLAS library. The matrix multiplication in non-batched forward passes is really matrix-vector multiplication, but you can shoehorn that into matrix-matrix multiplication by setting the right dimension to 1. cuBLAS uses column major ordering to match Fortran instead of the row major ordering of C/C++, although you can swap transa and transb, m and n, a and b, lda and ldb; and if you are using cublasGemmEx() to use tensor cores, atype and btype. This relies on the property of mathematics where A * B = C and B’ * A’ = C’ for matrices. You can experiment with this on a CPU using a BLAS library like OpenBLAS or the Intel MKL before trying to do this on a GPU.
That said, to do matrix vector multiplication (GEMV), you just need to compute dot products on all of the rows of the matrix with the vector to get your output vector. You could probably just handle each dot product in power of 2 chunks on the GPU to compute partial sums. When you get to the last chunk that is not a power of 2, just have the threads that go pass the end of the dot product contribute 0 to the partial sums. Then you would have the threads handle the horizontal sum of the partial sums to finish the computation of an entry in the output vector.
As for matrix-matrix multiplication (GEMM), that is much harder to do in a performant manner. If you do it by treating the multiplication as a series of matrix-vector multiplications, you will be bandwidth limited rather than compute limited. You would need to implement tiling to get good performance, but there are many tricks needed and it is very hard to outperform cuBLAS on Nvidia hardware. Here is one person’s attempts to do it:
https://siboehm.com/articles/22/CUDA-MMM
Here is a fairly cryptic description of how to get close to cuBLAS that mentions a few key techniques:
https://accu.org/journals/overload/32/181/schuetze/
Here is another guy’s attempts that actually managed to outperform cuBLAS on the H100:
https://cudaforfun.substack.com/p/outperforming-cublas-on-h1...
Off topic, but would you consider adding an RSS feed to your blog?
Thank you for writing this!
Excellent, amazing article.
To the author, if you're lurking here, I have a tangential question- how long did it take you to write this article? From first line of code to the last line of this post?
As someone who works in GPGPU space, I can imagine myself writing an article of this sort. But the huge uncertainty around time needed has deterred me so far.
Why not give it a shot and see how it goes? You should be able to have some idea if you want to proceed within an hour or two.
I'd say this was a bit over 3 weeks of full time coding, learning, and writing this blog post spread out over 1.5+ months. Someone with GPGPU/CUDA experience would've been able to do it faster. But I'd echo the other commenters here - just try writing some content and see if it works for you!
I don't think this code can make use of the tensor cores, or the wgmma instructions that you typically need to get peak performance out of them.
Programming these is a nightmare as you need to have several in flight concurrently for peak performance.
Perhaps you don't need the extra flops as you end up bandwidth bound?
Regardless the good thing about the code in the blog though is it'll probably work pretty well for other accelerators, if you port it to HIP or similar. If you use wgmma I'm not sure it'll even be portable across Nvidia generations.
For latency-bound inference (i.e. one request) you don't need tensor-cores since all your operations are just matrix vector multiplications.
Good point yes. That explains why he's getting performance similar to the leading frameworks. Those tensor operations are helpful for training or for throughput-optimised batched inference but not really for a batch size of one.
I actually didn't know that. I'm in the space as a hobbyist and I had a vague understanding that tensor cores are essential for reaching peak performance, but can only work for certain operations like dense matrix-matrix multiplication. It was on my list to investigate whether they could be used to further improve single-batch decoding - makes sense that they don't help when it's all matrix-vector.
I wonder how does the perf in tokens/second compares to my version of Mistral: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral...
BTW, see that section of the readme about quantization: https://github.com/Const-me/Cgml/tree/master?tab=readme-ov-f...
This is great thank you!
Does any one know of something similar in python? I want to share with my team something similar to this that goes into (almost) everything (at least conceptually) needed to efficiently serve an LLM.
It doesn’t actually need to be performant mind you (it’s in python) I just need something “conceptually complete” while being more “tutorial style” and concise than vLLM codebase
Using Jax, you should be able to get good performance out of the box
I notice that the CUDA example uses C++. I had not planned to publish my own work in this area yet (as I am still working on it), but if anyone wants fast llama inference using CUDA in C, here are some files:
It is a fork of:
https://github.com/jameswdelancey/llama3.c
I am compiling and running it this way:
nvcc -ptx -arch=sm_86 rung.cu -o rung.ptx
gcc -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/lib64 -O3 -g -lcuda -lcublas -lcudart -lm -lmvec -o rung rung.c
./rung "llama3_8b_base.bin" -z "tokenizer.bin" -t 0 -i "Once upon a time"
I am getting around 48 to 49 tokens per second with bf16 on my 3090 Ti from this. Oddly, llama.cpp only gets around 1 token per second with bf16, but it gets around 51 to 52 tokens per second with fp16. I suspect the performance gap would close if I use CUDA graphs.
The code is ugly in comparison to the original and I am not finished optimizing it, which is why I have not yet pushed it to a GitHub branch. I still have a few ways to reduce the amount of communication between the GPU and CPU per token, before using CUDA graphs. In particular, using the batched API for computing K and V, prefilling an array of pointers for the batched API across all layer iterations (such that all CPU to GPU pointer copies for the batched API are aggregated) and a few opportunities for kernel fusion, especially in rmsnorm after I figure out how to calculate the sum of squares quickly without cublas. I also have yet to try using Nvidia’s profiler.
Technically, CUDA C is really extended C++, but I am using a C-like subset of C++ for the device code. For the host code, I have written it in ANSI C. There are a few artifacts (the __host__ and __device__ keywords) from an earlier version where I had tried using CUDA C before I realized it was really extended C++ and not extended C, but those are trivial to remove (and will be removed before I push it to a branch). It should be possible to compile the device code with Clang instead of Nvidia’s compiler to use a fully open source toolchain, although I have yet to try it.
I have another fork here that uses the MKL to make the prompt processing portion run faster:
https://github.com/ryao/llama3.c
When I am happy with the CUDA version, I will either push it to a branch or to its own set of files in master.
Great article. I think what you should cover next is collective matmuls and sharding.
Isn’t __shfl_down not recommended these days because of warp synchronization issues?
Oops, you're right and it's a difference between my blog post and source code. It should be __shfl_down_sync as seen [here](https://github.com/andrewkchan/yalm/blob/8c908f23f5d8cc3f14c...)
What are the prerequities for this kind of thing? I've written ANNs back in college and understood backpropagation and gradient descent at some point but I don't know most of the terms mentioned in the architectural overview. How big of an investment is this?
super cool post!!