Karpathy's MicroGPT running at 50,000 tps on an FPGA
Posted by jawondo@reddit | LocalLLaMA | View on Reddit | 34 comments
Sure, it's only 4,192 parameters, but it's a start. Project write-up here: https://v2.talos.wtf/ and github repository here: https://github.com/Luthiraa/TALOS-V2
Some of the speed comes from having the weights onboard, rather than in external memory. Onboard ROM means with 16 bit weights current FPGAs max out at 20-30 million parameters, but maybe this and Taalas (https://taalas.com/ - similar names are unlikely a coincidence) will lead to more onboard ROM appearing in FPGAs or FPGAs dedicated to SLMs.
Song-Historical@reddit
There's so much potential with FPGA acceleration for local models it's nuts.
I've been trying to get people to pay attention to HILOS and Hillinfer projects that are taking SmartSSD's (basically an FPGA attached to flash storage) and offloading all the memory bound parts of LLM inference onto them especially for long context workflows. In theory there's no reason you couldn't make one in a form factor that will fit in an AI accelerator, mini PC or desktop/laptop you already have and then use it as a dedicated hardware based solution for your KV cache while still allowing for normal every day use.
You don't necessarily need the FPGA to do all of the inference for tasks you want some degree of oversight over. This is very cool.
Bohdanowicz@reddit
This is the future 50k tks local. Apps on demand. This will drive the ai os.
Song-Historical@reddit
Why would you need 50k tks transformers for apps?
is-this-a-nick@reddit
The potential is shit unless you would enjoy spending 10 times the money for 10% of the speed.
Like, seriously. FPGAs are cool, but you just because you have a nice hammer does not mean that they make for a good paintbrush.
BringTea_666@reddit
>There's so much potential with FPGA acceleration for local models it's nuts.
The issues is as always, you make FPGA for specific stuff and then half a year later stuff is not used anymore. You wasted that time and no one will buy/use it.
Like right now, half a year for LLM is an eon. In that half a year i switched like 5 models as primary go to model.
markole@reddit
For now. But for some use cases the model can be useful for a long time with access to tools and MCP. If I had Gemma 4 31B full precision running at thousands of tokens per second, I would find a way to steer it into usefulness even a year after it's release.
Fit-Produce420@reddit
Maybe you're thinking of ASIC where they are hardcoded and the archicture is inflexible.
dqUu3QlS@reddit
An FPGA can be reconfigured to become any neural network architecture, or any other logic circuit, that will fit. The real issue is FPGAs' tiny memory capacity.
Queasy-Contract9753@reddit
Makes sense. Sounds kind of like Deepspeed
CircularSeasoning@reddit
I want my doorway sentinel robot to dance welcomingly for friends and kill bad guys who try to enter. Am I in the wrong sub?
Song-Historical@reddit
Probably? Circle back in a few months.
CircularSeasoning@reddit
> maybe check your water for heavy metals.
Why, are these valuable? I could do with some extra cash.
sandropuppo@reddit
Very cool project
YearnMar10@reddit
Maybe LLMs will become good at designing FPGAs, so that they can implement themselves on the silicon
JustFinishedBSG@reddit
That’s actually slow for 5k params you know
dqUu3QlS@reddit
I've experimented with FPGAs before, not for running neural networks though. Although FPGA block RAM is very fast, it's very small. Typical FPGAs have less than a megabyte of block RAM, so if you want a model with more than a few million parameters on FPGA your options are:
dqUu3QlS@reddit
I underestimated the amount of memory higher end FPGAs have. The Alveo V80 has a total of ~84 megabytes of fast RAM. Still not enough for billion-parameter models but it's something.
Rasekov@reddit
That's not bad at all, if int4 could be made to work that's a 120-140M parameter model with some small room for context even with a ton of batched requests.
at 50-200K t/s that's a very nice embedding model, classification, PII detection, ... a lot of the bulk work that doesnt always need models in the billions of parameters and can be useful at those speeds.
Not sure if it would be enough to recover the investment in HW, much less developing the model, but they could be useful.
Current_Ferret_4981@reddit
Forgot to include the guy who compares with a mac (studio?) and got like 3M tps because it isn't the hardware/logic that was actually giving you speed here.
thomasthai@reddit
saw that guy on twitter, can confirm
Acrobatic-Desk3266@reddit
Could you link that please? Can't find it!
Current_Ferret_4981@reddit
https://x.com/i/status/2050706793899135240
stopnet54@reddit
Cool project. Does the software stack work for Xilinx FPGAs? Would be interesting to see if renting AWS F1 instances with more hardware resources will scale to slightly bigger models.
I always thought the limitation is amount of SRAM and DSP units making it a requirement to stream model weights from RAM in model stages.
Yes_but_I_think@reddit
Please wake me up the day you have hardware L3 cache the size of 32GB so that we can inference at 5million tokens/s. Till then there are PoCs which cannot scale. AT ALL. End of point.
OrphanedGland@reddit
I (well claude tbh) also ported microgpt to FPGA to evaluate the capabilities of claude. It can also turn it into a full custom chip design.
OrphanedGland@reddit
OrphanedGland@reddit
Sufficient_Sir_5414@reddit
Really interesting direction, putting weights in onboard ROM is a big shift. It cuts memory latency and energy, not just improves speed.
If FPGA designs start optimizing for SLMs (like TALOS + Taalas hint), we could see a new class of ultra low latency, local first AI.
Would love to see latency and energy/token benchmarks vs GPUs.
cvek101@reddit
Makes me wonder at what point do we hit a fpga size that becomes useful for speculative decoding of a larger model….
CircularSeasoning@reddit
Gremlins.
Client_Hello@reddit
People down voting dont know .... about ....
last_llm_standing@reddit
It looks super interesting but its hard for someone to get into. For someone who is familiar with transformer architecture, where to get started on the prerequisite for this material?
CircularSeasoning@reddit
The fun is in the fundamentals: Ignore the math, say what's on your mind, last_llm_standing.
No_Hunter_7786@reddit
50k tps even on 4k params shows the raw potential of dedicated hardware. The SmartSSD angle is interesting too, offloading KV cache to flash-attached FPGA could be a real solution for long context without needing massive VRAM