Karpathy's MicroGPT running at 50,000 tps on an FPGA

Posted by jawondo@reddit | LocalLLaMA | View on Reddit | 34 comments

Sure, it's only 4,192 parameters, but it's a start. Project write-up here: https://v2.talos.wtf/ and github repository here: https://github.com/Luthiraa/TALOS-V2

Some of the speed comes from having the weights onboard, rather than in external memory. Onboard ROM means with 16 bit weights current FPGAs max out at 20-30 million parameters, but maybe this and Taalas (https://taalas.com/ - similar names are unlikely a coincidence) will lead to more onboard ROM appearing in FPGAs or FPGAs dedicated to SLMs.

[-]

Song-Historical@reddit

There's so much potential with FPGA acceleration for local models it's nuts.

I've been trying to get people to pay attention to HILOS and Hillinfer projects that are taking SmartSSD's (basically an FPGA attached to flash storage) and offloading all the memory bound parts of LLM inference onto them especially for long context workflows. In theory there's no reason you couldn't make one in a form factor that will fit in an AI accelerator, mini PC or desktop/laptop you already have and then use it as a dedicated hardware based solution for your KV cache while still allowing for normal every day use.

You don't necessarily need the FPGA to do all of the inference for tasks you want some degree of oversight over. This is very cool.

[-]

Bohdanowicz@reddit

This is the future 50k tks local. Apps on demand. This will drive the ai os.

[-]

Song-Historical@reddit

Why would you need 50k tks transformers for apps?

[-]

is-this-a-nick@reddit

There's so much potential with FPGA acceleration for local models it's nuts.

The potential is shit unless you would enjoy spending 10 times the money for 10% of the speed.

Like, seriously. FPGAs are cool, but you just because you have a nice hammer does not mean that they make for a good paintbrush.

[-]

BringTea_666@reddit

>There's so much potential with FPGA acceleration for local models it's nuts.

The issues is as always, you make FPGA for specific stuff and then half a year later stuff is not used anymore. You wasted that time and no one will buy/use it.

Like right now, half a year for LLM is an eon. In that half a year i switched like 5 models as primary go to model.

[-]

markole@reddit

For now. But for some use cases the model can be useful for a long time with access to tools and MCP. If I had Gemma 4 31B full precision running at thousands of tokens per second, I would find a way to steer it into usefulness even a year after it's release.

[-]

Fit-Produce420@reddit

Maybe you're thinking of ASIC where they are hardcoded and the archicture is inflexible.

[-]

dqUu3QlS@reddit

An FPGA can be reconfigured to become any neural network architecture, or any other logic circuit, that will fit. The real issue is FPGAs' tiny memory capacity.

[-]

Queasy-Contract9753@reddit

Makes sense. Sounds kind of like Deepspeed

[-]

CircularSeasoning@reddit

I want my doorway sentinel robot to dance welcomingly for friends and kill bad guys who try to enter. Am I in the wrong sub?

[-]

Song-Historical@reddit

Probably? Circle back in a few months.

[-]

CircularSeasoning@reddit

> maybe check your water for heavy metals.

Why, are these valuable? I could do with some extra cash.

[-]

sandropuppo@reddit

Very cool project

[-]

YearnMar10@reddit

Maybe LLMs will become good at designing FPGAs, so that they can implement themselves on the silicon

[-]

JustFinishedBSG@reddit

That’s actually slow for 5k params you know

[-]

dqUu3QlS@reddit

I've experimented with FPGAs before, not for running neural networks though. Although FPGA block RAM is very fast, it's very small. Typical FPGAs have less than a megabyte of block RAM, so if you want a model with more than a few million parameters on FPGA your options are:

Split the model across many FPGAs. This was the approach taken by the first Groq accelerator cards. You can do thousands of tokens per second on models with a few billion parameters, but it costs millions of dollars to build.
Attach external memory to the FPGA. This is much less costly, but a GPU or TPU can access the same memory and achieve the same bandwidth. The FPGA's speed advantage disappears completely.

[-]

dqUu3QlS@reddit

I underestimated the amount of memory higher end FPGAs have. The Alveo V80 has a total of ~84 megabytes of fast RAM. Still not enough for billion-parameter models but it's something.

[-]

Rasekov@reddit

That's not bad at all, if int4 could be made to work that's a 120-140M parameter model with some small room for context even with a ton of batched requests.

at 50-200K t/s that's a very nice embedding model, classification, PII detection, ... a lot of the bulk work that doesnt always need models in the billions of parameters and can be useful at those speeds.

Not sure if it would be enough to recover the investment in HW, much less developing the model, but they could be useful.

[-]

Current_Ferret_4981@reddit

Forgot to include the guy who compares with a mac (studio?) and got like 3M tps because it isn't the hardware/logic that was actually giving you speed here.

[-]

thomasthai@reddit

saw that guy on twitter, can confirm

[-]

Acrobatic-Desk3266@reddit

Could you link that please? Can't find it!

[-]

Current_Ferret_4981@reddit

https://x.com/i/status/2050706793899135240

[-]

stopnet54@reddit

Cool project. Does the software stack work for Xilinx FPGAs? Would be interesting to see if renting AWS F1 instances with more hardware resources will scale to slightly bigger models.

I always thought the limitation is amount of SRAM and DSP units making it a requirement to stream model weights from RAM in model stages.

[-]

Yes_but_I_think@reddit

Please wake me up the day you have hardware L3 cache the size of 32GB so that we can inference at 5million tokens/s. Till then there are PoCs which cannot scale. AT ALL. End of point.

[-]

OrphanedGland@reddit

I (well claude tbh) also ported microgpt to FPGA to evaluate the capabilities of claude. It can also turn it into a full custom chip design.

[-]

OrphanedGland@reddit

[-]

OrphanedGland@reddit

[-]

Sufficient_Sir_5414@reddit

Really interesting direction, putting weights in onboard ROM is a big shift. It cuts memory latency and energy, not just improves speed.

If FPGA designs start optimizing for SLMs (like TALOS + Taalas hint), we could see a new class of ultra low latency, local first AI.

Would love to see latency and energy/token benchmarks vs GPUs.

[-]