Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip).

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 71 comments

TL:DR - Remembered FPGA PCI boards being a big thing from my crypto days. Wondered if AND Alveo V80 FPGA card could be used to approximate the performance of a Taalas HC1 (LLM-on-a-chip). Ran the idea past Gemini Pro for a feasibility / sanity check. It suggested what seemed to be a speculative decoding type of setup on the FPGA and said I might could get to 3,200 tk/s with a Q4 of Qwen3.5 4b or maybe 1;400 tk/s with 9b. Not Taalas HC1 speeds, but still pretty fast (potentially). Posting here to see if anyone has already tried anything like this. AMD V80 FPGAs cost like $9500 USD btw.

So I saw that company Taalas was burning the weights of Llama 3.1 8b to a chip and getting a ridiculous 15,000 tk/s. I don’t know when they are going to actually release this hardware, but it got me thinking. Obviously when you burn the model weights to the chip, that’s it, you’re going to be stuck using those weights forever, but If it’s a good model then, it’s still pretty awesome!

Back in my crypto mining days. I had a BitMain L3+ ASIC miner which had the Script Algo essentially burnt to a bunch of chips so it could nine script algo really fast. You could also script mine with a GPU of course but it was suprt slow compared to a ASIC miner, for the same reason a Taalas chip would beat a GPU for serving a hard coded model.

People got tired of ASIC miners because they were only good at one Algo and when a more efficient one came out, the one you bought 6 months ago was now a useless paper weight. Then came Field Programmable Gate Arrays (FPGAs), they were supposedly going to be the best of both worlds. They were going to have better speed than mining with a GPU, but reprogrammable so that you weren’t stuck with one Algo that had been burnt onto the chips like with ASIC. So essentially it was effectively a reprogrammable ASIC. This was about the time when ETH went to proof-of-stake and all crypto mining kind of took a nose dive and died off. But I was probably months away from buying a Xilimix Alveo U55C (16GB HBM) PCIe FPGA. I didn’t end up buying it, but kinda wish I had now for the following reason:

When I heard about the Taalas LLM on a chip getting insane 15,000 t/s, I thought why doesn’t someone just do the same thing with an FPGA PCI card. Could you use an FPGA card like the Xilinix U55C and kind of do what Taalas did but burn the weights to the FPGA HBM so you get a similar speed boost as if the weights were burnt on a chip but you get the advantages of the reprogramability of an FPGA?

I ran this idea by Gemini Pro to get a sanity check. The Alveo U55C was too limited in its memory, but the AMD Alveo V80 (I guess AMD bought out Xillinx) has some intriguing specs that Gemini seemed to think might actually have a small chance in being used in a way that could accelerate inference beyond current consumer GPU speeds (not on par with Tallas, but faster than GPU) Here is the direct response from Gemini:

Question: Thought experiment: how would you best use the Alveo V80 to accelerate inference to a level close to the Taalas? Do your best to come up with something new that would accelerate inference. Give me the overview of what you would do (not the code).

This is exactly the kind of hardware puzzle I love diving into. To get an Alveo V80 to chase the Taalas HC1, we have to stop treating the V80 like a GPU and start exploiting its true nature: spatial computation and programmable memory controllers. Since we can't eliminate the High-Bandwidth Memory (HBM) wall entirely like Taalas does, our thought experiment needs to focus on effectively making the HBM invisible.

Here is an architectural overview of how I would build a V80 inference engine to bridge that gap. We’ll call this concept the Dual-Tier Speculative Fabric with Dynamic Activation-Routed Fetching (DARF).

The "Taalas-Lite" SRAM Draft Core

The V80 has roughly 673 MB of internal UltraRAM and Block RAM. We dedicate this entirely to a highly distilled, ultra-quantized "draft" model.

The Concept: We train a tiny (e.g., 500M parameter) version of our target LLM and quantize it down to 1-bit or 1.5-bit (ternary) weights. This fits completely inside the on-chip SRAM.

The Execution: Because all weights are on-chip, this draft model operates exactly like the Taalas HC1. It suffers zero HBM fetch latency and can spit out speculative tokens at blistering speeds using the Versal AI Engines.

Dynamic Activation-Routed Fetching (DARF)

This is where we solve the HBM bottleneck for the main, highly accurate model (e.g., an 8B parameter model) stored in the 32GB of HBM. FPGAs excel at scattered, irregular memory access—something that absolutely chokes GPUs.

The Concept: LLMs have massive activation sparsity; during inference, many neurons in a layer simply output zero and don't contribute to the next layer.

The Execution: We program custom memory controllers in the V80's FPGA fabric. Before a layer's computation begins, a tiny, predictive routing algorithm looks at the incoming token and determines which specific weights are actually needed. Instead of blindly streaming the entire 8B model matrix from HBM (which limits throughput), the custom memory controller performs fine-grained "scatter-gather" reads, fetching only the 15% to 20% of the weights that matter. By fetching 5x less data, we effectively multiply our 820 GB/s HBM bandwidth to act like a 4 TB/s connection.

Asymmetric, Hardware-Baked Quantization

Taalas uses custom 3-bit and 6-bit math. We will take this a step further by tailoring the math to the distribution of the specific model's weights.

The Concept: Standard processors use linear formats (INT4, FP8). We use the V80's Look-Up Tables (LUTs) to create custom Logarithmic Number Systems (LNS) or uneven bit-widths (e.g., 3.2-bit precision).

The Execution: The weights in HBM are packed in this custom format. When DARF streams them onto the chip, they are fed directly into custom-wired DSP slices that are physically synthesized for this exact, bizarre math. This packs significantly more model density into every single HBM read cycle.

Continuous Spatial Pipelining

GPUs process models temporally: load Layer 1, compute, store to memory; load Layer 2, compute, store to memory.

The Concept: We treat the V80 not as an ALU, but as an assembly line.

The Execution: We physically map the transformer architecture across the silicon. The HBM streams the custom-quantized weights continuously. Token data flows sequentially from the LUTs (doing layer normalization), into the AI Engines (doing the DARF-filtered matrix math), and out to the next physical sector of the chip. The token data never goes back to external memory until the final prediction is made.

The Result:

We use the SRAM Draft Core to generate 5-10 tokens instantly with zero memory latency (the Taalas party trick). We then use the DARF-enabled main model streaming from HBM to verify those tokens in one shot, but because we are only fetching a fraction of the weights using custom non-linear precision, the verification step takes a fraction of the usual time.

You would still lose to Taalas on raw power efficiency (Joules per token), because powering HBM simply takes more energy than static SRAM. But in terms of Tokens Per Second, this architecture could push the V80 into the same neighborhood.

If you were building this, which specific open-source model architecture (like Llama, Mistral, or a Mixture-of-Experts) do you think would be the best candidate to map onto this kind of highly sparse, physically pipelined FPGA setup?

I responsded with: "I was thinking about maybe a Qwen3.5 9b or 4b at Q4 (or whatever would fit). Is this possible and what do you think the tokens per second would be using the architecture you described? "

and it followed up with:

It is absolutely possible, and those model sizes hit the sweet spot for the V80's memory capacity. Qwen models (whether we are talking about the current 2.5 generation or a hypothetical 3.5) have excellent architectural efficiency, making them great targets for this kind of hardware manipulation.

Here is how we would map a Qwen 4B or 9B at Q4 precision onto our hypothetical "DARF" architecture, and the theoretical math to calculate the tokens per second (tok/s).

The Memory Mapping (The "Fit" Check)

At 4-bit (Q4) quantization, the memory footprints are highly manageable:

Qwen 4B at Q4: Requires roughly 2 GB of memory.

Qwen 9B at Q4: Requires roughly 4.5 GB of memory.

Both of these fit comfortably inside the V80's 32 GB of HBM2e. However, remember that our architecture uses a "Dual-Tier" system. Because the V80 only has \~673 MB of internal SRAM, the 4B and 9B models cannot act as the instantaneous Draft Core.

The Setup: We would put the Qwen 4B or 9B into the HBM as the "Main Validator Model." We would then need to distill a tiny, sub-1-billion parameter Qwen variant (quantized to 1 or 2 bits so it fits under 600 MB) to live permanently on the SRAM as the "Speculative Draft Core."

The Theoretical Math

In traditional LLM inference (at a batch size of 1), the absolute speed limit is dictated by memory bandwidth. The V80 has a massive HBM bandwidth of 820 GB/s.

Step A: The Standard Baseline (No Custom Architecture) If we ran the V80 normally, fetching the whole model for every token:

Qwen 4B: 820 GB/s ÷ 2 GB = 410 tok/s

Qwen 9B: 820 GB/s ÷ 4.5 GB = 182 tok/s

Step B: Applying the DARF Multiplier (Sparsity) Our Dynamic Activation-Routed Fetching intercepts the tokens and only fetches the weights that are mathematically active. If we assume a conservative 75% sparsity (meaning we only need to read 25% of the model from HBM per token), our effective bandwidth multiplies by 4.

Qwen 4B with DARF: 410 tok/s × 4 = 1,640 tok/s

Qwen 9B with DARF: 182 tok/s × 4 = 728 tok/s

Step C: Applying the SRAM Draft Core (Speculative Decoding) While the HBM is doing its sparse fetching, our ultra-tiny Draft Core on the SRAM is instantly generating 3 to 5 speculative tokens ahead of time with zero memory latency. The Main Model just verifies them in a single batch. A highly tuned speculative decoding setup can yield a 2x to 2.5x speedup. We will use a conservative 2x multiplier.

Qwen 4B Final: 1,640 tok/s × 2 = 3,280 tok/s

Qwen 9B Final: 728 tok/s × 2 = 1,456 tok/s

The Verdict

Using the Alveo V80 with this custom, highly sparse spatial pipeline, you could theoretically achieve \~3,200 tok/s for the 4B model and \~1,400 tok/s for the 9B model at Q4.

While Taalas aims for five-figure token rates, getting 1,400 to 3,200 tok/s on a single piece of commercially available hardware is still face-meltingly fast—easily 10x to 15x faster than running the same models on a high-end consumer GPU.

If we were to actually attempt to synthesize this, the hardest bottleneck wouldn't be the AI math, but writing the custom memory controllers in Verilog/VHDL to handle those scattered reads.

Of course, this could all just be a goading sycophantic hallucination by Gemini Pro, but I thought I would post it here to see if anyone had been looking into anything similar and had tried anything like this.

Now I just need $9500 for an AMD Alveo V80 LOL

[-]

guinaifen_enjoyer@reddit

One approach is to convert existing CUDA kernels or OpenCL kernels to HLS C++ using Vitis HLS, which runs directly on it.

Most LLMs can do this pretty easily and it is very easy to setup because you don't need to worry about timings that much for ALUs and MACs.

Next stage would be to convert a working HLS C++ to VHDL or Verilog, this is much harder because you have to deal with memory and ALU/MAC and Block RAM/Dual Port RAM timings. One cycle off and the whole thing doesn't work. Also you have to create a scheduler by hand to schedule thousands of ALU/MACs and dual port ram/Block RAM.

[-]

Creepy-Bell-4527@reddit

This approach turns the $9500 FPGA into a $60 NPU that's constrained by memory bandwidth.

If you want to get best use of the FPGA, you want to be encoding the weights into the gate array.

[-]

Porespellar@reddit (OP)

That was what I was thinking originally, because that’s essentially what an FPGA is, like a big array of virtual reconfigurable transistors, right?

[-]

fuckingredditman@reddit

HomsarWasRight@reddit

Always. I often have it do a little research on something just running through my head and it’s always like “Let’s build a plan!”

For even higher speeds you can encode the weights directly in the FPGA fabric. Now this WON'T work for something the size of an LLM but these deep neural networks are just computational graphs and you can directly reproduce them on die. I've seen this for large perception arrays that need to run at extreme speed.

[-]

atape_1@reddit

Programing the thing would be an absolute nightmare. FPGA coding is hard, I've seen less ambitious FPGA projects, involving full teams fail.

[-]

guinaifen_enjoyer@reddit

Converting CUDA and OpenCL kernels to HLS C++ is actually petty easy.

CUDA:

__global__ void mac_kernel(const float* a, const float* b, float* c, int n) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < n) {

c[i] = c[i] + (a[i] * b[i])

c[i] += a[i] * b[i];

}

HLS C++

void hls_mac(float *a, float *b, float *c, int n) {

#pragma HLS INTERFACE m_axi port=a offset=slave bundle=gmem

#pragma HLS INTERFACE m_axi port=b offset=slave bundle=gmem

#pragma HLS INTERFACE m_axi port=c offset=slave bundle=gmem

#pragma HLS INTERFACE s_axilite port=return

for (int i = 0; i < n; i++) {

#pragma HLS PIPELINE II=1

c[i] += a[i] * b[i];

}

[-]

jazir55@reddit

However, HLS C++ is far slower than VHDL or Verilog and converting HLS C++ to VHDL or Verilog is a nightmare.

Aren't we at the point where we can just point an agent at it and have them port it themselves without manual intervention given the prompt was good enough?

[-]

quinn50@reddit

Yea and no, models are definitely good at translating from lang to lang but I fear many datasets most likely have a severe lack of FPGA programming knowledge.

There is a guy doing this already with an FPGA that costs 250-350 dollars, but he tried to train a distilled version of Gemma 4. https://github.com/n57d30top/gemma4-on-FPGA

I think the sweet spot would be running on the FPGA a quantized version of Qwen 3.6 (or 3.5) of 0.8b or 1.5b and use the verifier model on a regular GPU running qwen 27b or 35b A3B. There is a math formula for how much actual speed up you can get from speculative decoding and speculative speculative decoding and if you ask an AI model, it will quickly fetch you the article with the math for it and you can ask it to plug some number for you.

Definitely have a look at that repo, if it were easier to buy that FPGA in Brazil, I would surely have bought it to test.

[-]

pmttyji@reddit

When are we gonna get affordable($499-999) LLM burners with big/large models(500B-1T) returns 20-50 t/s?

[-]

Pleasant-Shallot-707@reddit

Taalas’s concept seems silly.

[-]

Rustybot@reddit

16000 tokens per second per user is a crazy output.

[-]

d1722825@reddit

FPGAs excel at scattered, irregular memory access—something that absolutely chokes GPUs.

I don't think that is true. Irregular memory access is an issue of how DRAMs work, and not an issue of FPGAs or CPUs or GPUs.

FPGAs can access their internal memory (block ram, distributed ram) much better than CPUs or GPUs their cache, but you only have a few MiB (megabytes) of those at most.

Also 9B model with 182 t/s, if you only need a single multiplication per parameter per output token, that would eat up nearly all your multipliers in the FPGA. For 1456 t/s you would need 6 times more.

I haven't found the spec, so assumed 200 MHz clock cycle for DSP slices (that may be a bit conservative for newer FPGAs). Technically probably you could use a single multiplier to do maybe 4 different 4-bit multiplication at the same time. I'm not familiar with how inference computation scales, but I suspect it needs more than 1 multiplication per parameter per token.

[-]

Double_Cause4609@reddit

A) I'm actually not sure that Gemini's plan makes sense. I'd have to think about it, but I think something is getting missed here.

B) The reason Taalas works the way it does is it doesn't have centralized memory (like HBM). You can't really do the same thing they're doing with a general purpose processing unit. It's not really "like Taalas" in any real way. This is completely its own thing.

C) The reason Taalas is appealing is that read-only burnt transistors are way cheaper than central general purpose memory. You don't get the benefits of that on an FPGA (you're still paying for the full possibility space).

The only viable way I see this working is maybe with some sort of weight re-use scheme like recovering a relaxed universal transformer with layerwise LoRAs or something where the LoRAs are read from HBM, and even then, I'm not sure any FPGA out there has enough capacity in transistor count to run a single layer in this way.

D) In verdict, even if you *could* do this with an FPGA, it still wouldn't make sense. The fixed function cards are so much cheaper (even for larger models) that you would feel stupid buying an FPGA to do this, even if it did work. Just give it a year and a half and see how the fixed function cards are working out. The industry will bring solutions. People want to run models efficiently. It'll happen. Give it time.

[-]

rowdy_1c@reddit

I mean yeah I suppose you can fit model weights into BRAMs, but as your LLM said it would be a heavily quantized small model. So if you get a really high token rate but your model is lobotomized, how helpful really is that?

[-]

SOCSChamp@reddit

I had a similar thought as well when I heard about taalas and tried the demo. I don't think your gemini solution is really what you'd want to do here, taalas actually burned the weights onto the chip.

Interesting idea. 9.5k 'poor man' solution gave me a chuckle, but I guess nothing is cheap in the world of AI these days.

Even if the 'speculative decoding' part works, you're still using a highly quantized model right? If so, that wouldn't be that useful for antigenic tasks. Seems less risky to just by an RTX PRO 6000 at that point.

[-]

woadwarrior@reddit

FWIW, the Taalas implementation is also “aggressively quantized, combining 3-bit and 6-bit parameters”.

[-]

Porespellar@reddit (OP)

I'm definitely curious what the price point of the Taalas HC1 board is going to be when its released. Who knows, maybe it'll be super cheap. I just hope they have a better model then Llama 3.1 on it when it finally ships.

[-]

NoFaithlessness951@reddit

My understanding was that their demo board would cost them in the order of $400 - 1000$ per card to mass produce, then add some margin for them on top of it.

Their first "real" product will likely be a card with a newer 20-40b parameter model which would cost them on the order of $1.5k to $5k to produce depending on scale architecture etc. Whether they're going to sell it to anyone (even enterprise) that idk, but they'll at least provide API access to (some) customers.

I'm largely pulling these numbers out of my ass so take this with a grain of salt.

[-]

nabil_t@reddit

Yeah, it will be interesting to see if it will be available to retail consumers. It seems like they are working with data centers right now since that's where the money is.

[-]

tamerlanOne@reddit

Taalas.com ha senso su auto, droni o robot che devono avere input velocissimi.

Per un uso generico già superare i 250 token /secondo lo vedo overkill per la maggio parte delle applicazioni consumer. A quel punyo meglio avere la metà dei token ma su un modello più grande e performante

[-]

Thrumpwart@reddit

You may be able to use autoresearcher to iteratively test the programming of the FPGA gates. This could theoretically speed up programming it significantly.

[-]

MelodicRecognition7@reddit

The A

The B

The C

ask Gemini Pro to write the FPGA code for you, if it fails then you're out of luck as nobody else will do that.

[-]

Porespellar@reddit (OP)

Yeah, I know Gemini is probably trolling me with a "cool story bro", but I just thought I would see if anyone is messing with FPGAs to do anything interesting related to inference. I don't mind opening myself up to ridicule if it sparks lively discussion :)

[-]