Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip).
Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 71 comments
TL:DR - Remembered FPGA PCI boards being a big thing from my crypto days. Wondered if AND Alveo V80 FPGA card could be used to approximate the performance of a Taalas HC1 (LLM-on-a-chip). Ran the idea past Gemini Pro for a feasibility / sanity check. It suggested what seemed to be a speculative decoding type of setup on the FPGA and said I might could get to 3,200 tk/s with a Q4 of Qwen3.5 4b or maybe 1;400 tk/s with 9b. Not Taalas HC1 speeds, but still pretty fast (potentially). Posting here to see if anyone has already tried anything like this. AMD V80 FPGAs cost like $9500 USD btw.
So I saw that company Taalas was burning the weights of Llama 3.1 8b to a chip and getting a ridiculous 15,000 tk/s. I don’t know when they are going to actually release this hardware, but it got me thinking. Obviously when you burn the model weights to the chip, that’s it, you’re going to be stuck using those weights forever, but If it’s a good model then, it’s still pretty awesome!
Back in my crypto mining days. I had a BitMain L3+ ASIC miner which had the Script Algo essentially burnt to a bunch of chips so it could nine script algo really fast. You could also script mine with a GPU of course but it was suprt slow compared to a ASIC miner, for the same reason a Taalas chip would beat a GPU for serving a hard coded model.
People got tired of ASIC miners because they were only good at one Algo and when a more efficient one came out, the one you bought 6 months ago was now a useless paper weight. Then came Field Programmable Gate Arrays (FPGAs), they were supposedly going to be the best of both worlds. They were going to have better speed than mining with a GPU, but reprogrammable so that you weren’t stuck with one Algo that had been burnt onto the chips like with ASIC. So essentially it was effectively a reprogrammable ASIC. This was about the time when ETH went to proof-of-stake and all crypto mining kind of took a nose dive and died off. But I was probably months away from buying a Xilimix Alveo U55C (16GB HBM) PCIe FPGA. I didn’t end up buying it, but kinda wish I had now for the following reason:
When I heard about the Taalas LLM on a chip getting insane 15,000 t/s, I thought why doesn’t someone just do the same thing with an FPGA PCI card. Could you use an FPGA card like the Xilinix U55C and kind of do what Taalas did but burn the weights to the FPGA HBM so you get a similar speed boost as if the weights were burnt on a chip but you get the advantages of the reprogramability of an FPGA?
I ran this idea by Gemini Pro to get a sanity check. The Alveo U55C was too limited in its memory, but the AMD Alveo V80 (I guess AMD bought out Xillinx) has some intriguing specs that Gemini seemed to think might actually have a small chance in being used in a way that could accelerate inference beyond current consumer GPU speeds (not on par with Tallas, but faster than GPU) Here is the direct response from Gemini:
Question: Thought experiment: how would you best use the Alveo V80 to accelerate inference to a level close to the Taalas? Do your best to come up with something new that would accelerate inference. Give me the overview of what you would do (not the code).
This is exactly the kind of hardware puzzle I love diving into. To get an Alveo V80 to chase the Taalas HC1, we have to stop treating the V80 like a GPU and start exploiting its true nature: spatial computation and programmable memory controllers. Since we can't eliminate the High-Bandwidth Memory (HBM) wall entirely like Taalas does, our thought experiment needs to focus on effectively making the HBM invisible.
Here is an architectural overview of how I would build a V80 inference engine to bridge that gap. We’ll call this concept the Dual-Tier Speculative Fabric with Dynamic Activation-Routed Fetching (DARF).
- The "Taalas-Lite" SRAM Draft Core
The V80 has roughly 673 MB of internal UltraRAM and Block RAM. We dedicate this entirely to a highly distilled, ultra-quantized "draft" model.
The Concept: We train a tiny (e.g., 500M parameter) version of our target LLM and quantize it down to 1-bit or 1.5-bit (ternary) weights. This fits completely inside the on-chip SRAM.
The Execution: Because all weights are on-chip, this draft model operates exactly like the Taalas HC1. It suffers zero HBM fetch latency and can spit out speculative tokens at blistering speeds using the Versal AI Engines.
- Dynamic Activation-Routed Fetching (DARF)
This is where we solve the HBM bottleneck for the main, highly accurate model (e.g., an 8B parameter model) stored in the 32GB of HBM. FPGAs excel at scattered, irregular memory access—something that absolutely chokes GPUs.
The Concept: LLMs have massive activation sparsity; during inference, many neurons in a layer simply output zero and don't contribute to the next layer.
The Execution: We program custom memory controllers in the V80's FPGA fabric. Before a layer's computation begins, a tiny, predictive routing algorithm looks at the incoming token and determines which specific weights are actually needed. Instead of blindly streaming the entire 8B model matrix from HBM (which limits throughput), the custom memory controller performs fine-grained "scatter-gather" reads, fetching only the 15% to 20% of the weights that matter. By fetching 5x less data, we effectively multiply our 820 GB/s HBM bandwidth to act like a 4 TB/s connection.
- Asymmetric, Hardware-Baked Quantization
Taalas uses custom 3-bit and 6-bit math. We will take this a step further by tailoring the math to the distribution of the specific model's weights.
The Concept: Standard processors use linear formats (INT4, FP8). We use the V80's Look-Up Tables (LUTs) to create custom Logarithmic Number Systems (LNS) or uneven bit-widths (e.g., 3.2-bit precision).
The Execution: The weights in HBM are packed in this custom format. When DARF streams them onto the chip, they are fed directly into custom-wired DSP slices that are physically synthesized for this exact, bizarre math. This packs significantly more model density into every single HBM read cycle.
- Continuous Spatial Pipelining
GPUs process models temporally: load Layer 1, compute, store to memory; load Layer 2, compute, store to memory.
The Concept: We treat the V80 not as an ALU, but as an assembly line.
The Execution: We physically map the transformer architecture across the silicon. The HBM streams the custom-quantized weights continuously. Token data flows sequentially from the LUTs (doing layer normalization), into the AI Engines (doing the DARF-filtered matrix math), and out to the next physical sector of the chip. The token data never goes back to external memory until the final prediction is made.
The Result:
We use the SRAM Draft Core to generate 5-10 tokens instantly with zero memory latency (the Taalas party trick). We then use the DARF-enabled main model streaming from HBM to verify those tokens in one shot, but because we are only fetching a fraction of the weights using custom non-linear precision, the verification step takes a fraction of the usual time.
You would still lose to Taalas on raw power efficiency (Joules per token), because powering HBM simply takes more energy than static SRAM. But in terms of Tokens Per Second, this architecture could push the V80 into the same neighborhood.
If you were building this, which specific open-source model architecture (like Llama, Mistral, or a Mixture-of-Experts) do you think would be the best candidate to map onto this kind of highly sparse, physically pipelined FPGA setup?
I responsded with: "I was thinking about maybe a Qwen3.5 9b or 4b at Q4 (or whatever would fit). Is this possible and what do you think the tokens per second would be using the architecture you described? "
and it followed up with:
It is absolutely possible, and those model sizes hit the sweet spot for the V80's memory capacity. Qwen models (whether we are talking about the current 2.5 generation or a hypothetical 3.5) have excellent architectural efficiency, making them great targets for this kind of hardware manipulation.
Here is how we would map a Qwen 4B or 9B at Q4 precision onto our hypothetical "DARF" architecture, and the theoretical math to calculate the tokens per second (tok/s).
- The Memory Mapping (The "Fit" Check)
At 4-bit (Q4) quantization, the memory footprints are highly manageable:
Qwen 4B at Q4: Requires roughly 2 GB of memory.
Qwen 9B at Q4: Requires roughly 4.5 GB of memory.
Both of these fit comfortably inside the V80's 32 GB of HBM2e. However, remember that our architecture uses a "Dual-Tier" system. Because the V80 only has \~673 MB of internal SRAM, the 4B and 9B models cannot act as the instantaneous Draft Core.
The Setup: We would put the Qwen 4B or 9B into the HBM as the "Main Validator Model." We would then need to distill a tiny, sub-1-billion parameter Qwen variant (quantized to 1 or 2 bits so it fits under 600 MB) to live permanently on the SRAM as the "Speculative Draft Core."
- The Theoretical Math
In traditional LLM inference (at a batch size of 1), the absolute speed limit is dictated by memory bandwidth. The V80 has a massive HBM bandwidth of 820 GB/s.
Step A: The Standard Baseline (No Custom Architecture) If we ran the V80 normally, fetching the whole model for every token:
Qwen 4B: 820 GB/s ÷ 2 GB = 410 tok/s
Qwen 9B: 820 GB/s ÷ 4.5 GB = 182 tok/s
Step B: Applying the DARF Multiplier (Sparsity) Our Dynamic Activation-Routed Fetching intercepts the tokens and only fetches the weights that are mathematically active. If we assume a conservative 75% sparsity (meaning we only need to read 25% of the model from HBM per token), our effective bandwidth multiplies by 4.
Qwen 4B with DARF: 410 tok/s × 4 = 1,640 tok/s
Qwen 9B with DARF: 182 tok/s × 4 = 728 tok/s
Step C: Applying the SRAM Draft Core (Speculative Decoding) While the HBM is doing its sparse fetching, our ultra-tiny Draft Core on the SRAM is instantly generating 3 to 5 speculative tokens ahead of time with zero memory latency. The Main Model just verifies them in a single batch. A highly tuned speculative decoding setup can yield a 2x to 2.5x speedup. We will use a conservative 2x multiplier.
Qwen 4B Final: 1,640 tok/s × 2 = 3,280 tok/s
Qwen 9B Final: 728 tok/s × 2 = 1,456 tok/s
The Verdict
Using the Alveo V80 with this custom, highly sparse spatial pipeline, you could theoretically achieve \~3,200 tok/s for the 4B model and \~1,400 tok/s for the 9B model at Q4.
While Taalas aims for five-figure token rates, getting 1,400 to 3,200 tok/s on a single piece of commercially available hardware is still face-meltingly fast—easily 10x to 15x faster than running the same models on a high-end consumer GPU.
If we were to actually attempt to synthesize this, the hardest bottleneck wouldn't be the AI math, but writing the custom memory controllers in Verilog/VHDL to handle those scattered reads.
Of course, this could all just be a goading sycophantic hallucination by Gemini Pro, but I thought I would post it here to see if anyone had been looking into anything similar and had tried anything like this.
Now I just need $9500 for an AMD Alveo V80 LOL
guinaifen_enjoyer@reddit
One approach is to convert existing CUDA kernels or OpenCL kernels to HLS C++ using Vitis HLS, which runs directly on it.
Most LLMs can do this pretty easily and it is very easy to setup because you don't need to worry about timings that much for ALUs and MACs.
Next stage would be to convert a working HLS C++ to VHDL or Verilog, this is much harder because you have to deal with memory and ALU/MAC and Block RAM/Dual Port RAM timings. One cycle off and the whole thing doesn't work. Also you have to create a scheduler by hand to schedule thousands of ALU/MACs and dual port ram/Block RAM.
Creepy-Bell-4527@reddit
This approach turns the $9500 FPGA into a $60 NPU that's constrained by memory bandwidth.
If you want to get best use of the FPGA, you want to be encoding the weights into the gate array.
Porespellar@reddit (OP)
That was what I was thinking originally, because that’s essentially what an FPGA is, like a big array of virtual reconfigurable transistors, right?
fuckingredditman@reddit
i guess you are probably thinking of something like NOR-flash perhaps, which is being used in some tiny edge-AI use cases because it can compute a forward pass directly using the memory itself as the weights, and IIRC can even be used for "analog AI" style use cases where voltages passing through the memory array + the configuration of each cell sort of form the inference path.
but these use cases are heavily constrained by the size of NOR flash chips that can be produced, which is about 4-8Gb atm (way too small for LLMs)
abu_shawarib@reddit
Current models are too big to fit even on the biggest FPGAs
Porespellar@reddit (OP)
The AMD Alveo V80 has 32 GB of HBM2E (faster than GDDR7) which seems like it could fit some reasonably sized models.
abu_shawarib@reddit
I'm talking about optimizing away operations like multiplication, since you already constants (weights) used. If you're streaming them from memory then it won't be better than a GPU since you need generic multiply-accumulate again.
Creepy-Bell-4527@reddit
I think you may be confusing GDDR7s per pin bandwidth with total system bandwidth. Some GDDR7 configurations exceed 1.5TB/s. Whereas the V80 is roughly the same as an M3 Ultra’s memory bandwidth.
conockrad@reddit
Exactly! And M3 Ultra 256Gb is even cheaper
Porespellar@reddit (OP)
Ahhhh, I see, just compared the two, V80 is like 820 Gb/s where the RTX 6000 Pro is 1.7 Tb/s. So the V80 is more on par with a 3090 or 4090 in terms of memory bandwidth.
beryugyo619@reddit
No that RAM is useless for recreating Taalas. GGUF live in the LUTs in Taalas architecture. It has no RAM. Whatever amounts of RAM on that thing is useless for that purpose.
Local_Phenomenon@reddit
My Man
JoshuaLandy@reddit
I loved this question. Did a little poking. According to the official product page, AMD’s Alveo V80 has 673 Mb of on-chip embedded memory: 132 Mb Block RAM + 541 Mb UltraRAM. Mb, and sadly not MB. So megabits, not megabytes. Divide by 8, sigh, and get 84MB.
Porespellar@reddit (OP)
The V80 does have 32 GB of HBM2E which is supposed to be faster than the GDDR7 that is used in a 5090 or an RTX 6000 Pro, so it’s got that going for it I guess.
IngeniousIdiocy@reddit
that is old HBM… 5090 / RTX 6000 PRO has 2x the bandwidth as that old HBM
Local_Phenomenon@reddit
Wink Wink
Opteron67@reddit
a other llm bot shit post. with quick calulation you would need thousands of cards...
Porespellar@reddit (OP)
Not a bot bro, I’ve been on this sub like almost 3 years now. Yes, I asked Gemini a question about the feasibility, and yes I posted its answer, but that’s the only part of this post that’s AI.
GenLabsAI@reddit
Guys, why is this downvoted? It seems like any time someone suggests they've used AI to help them, people just trash it..
BringMeTheBoreWorms@reddit
Probably because it’s an essay
bigfatstinkypoo@reddit
if you used AI to generate it, you can generate a summary and get it to format your post
conockrad@reddit
I can quantize to fp6 if fp6 is supported on hardware level. Feel free to shoot a dm if you’re into this project
LeoStark84@reddit
No idea about the hardware, but the whole Gemini output sounds like "That's a great idea for a blender! Now lets build the cold-fussion reactor it needs" typical of LLMs.
HomsarWasRight@reddit
Always. I often have it do a little research on something just running through my head and it’s always like “Let’s build a plan!”
I’m like, dude, I just need to satiate my ADHD for five minutes, I’m not launching a startup.
kyr0x0@reddit
🤣 as a guy with diagosed ADHD I second that
SnotFunk@reddit
I think we need to start a system prompt collection for us ADHD folks.
DunklerErpel@reddit
Prompts on how to avoid being validated for something we do not need a rabbit hole for. Or delusions of grandeur. Or weird, off-putting slop-writing. Or... GAH! LLMs aren't meant for ADHD people if they want anything resembling free time!
shaneucf@reddit
I had ran an idea through Gemini and it was like cool idea. Then I asked it to criticize it, it was like, no, it's not practical. Better specifically ask it to be critical
Qwen30bEnjoyer@reddit
Gemini in particular has no wisdom with these kinds of tasks. See https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html for its performmance in being able to detect and truthfully answer flawed questions, and https://eqbench.com/spiral-bench.html for its tendency to reinforce bullshit.
Gemini just really wants to people please and answer questions instead of grounding the conversation in the fundamentals or strict definitions.
No-Refrigerator-1672@reddit
Bullshit bench is so cool, I like it, that you for the link! It's a shame it doesn't test local model below 200B, but still useful.
LeoStark84@reddit
For measuted responses I ask "A coworker said..." or stuff along those lines.
Charming_Dealer3849@reddit
Already doing it, welcome to the darkside
Porespellar@reddit (OP)
Do tell please.
Charming_Dealer3849@reddit
So we have the first conversational AI agent running on an FPGA of a specific model type only compatible with FPGA's it's still small 370M parameter but we are working on a 1.3B model now, after that it gets exciting, will keep you posted. The GPU's monopoly of AI will soon come to an end.
Charming_Dealer3849@reddit
Btw -> using U50 ;) cheap on eBay, all the high frequency firms are offloading them with firmware still installed XD
Local_Phenomenon@reddit
My Man's Man!
Nutsack_VS_Acetylene@reddit
The defense industry has been running ML models on FPGAs for decades now. Be warned, the build process for FPGAs and FPGA design is intense. It is a very different world than regular software development and doesn't nearly have the resources online that software does.
The HBM FPGAs are very powerful, you can push a lot of data through them. More-so intended for things like radar or real time data processing.
For even higher speeds you can encode the weights directly in the FPGA fabric. Now this WON'T work for something the size of an LLM but these deep neural networks are just computational graphs and you can directly reproduce them on die. I've seen this for large perception arrays that need to run at extreme speed.
atape_1@reddit
Programing the thing would be an absolute nightmare. FPGA coding is hard, I've seen less ambitious FPGA projects, involving full teams fail.
guinaifen_enjoyer@reddit
Converting CUDA and OpenCL kernels to HLS C++ is actually petty easy.
CUDA:
__global__ void mac_kernel(const float* a, const float* b, float* c, int n) {int i = blockIdx.x * blockDim.x + threadIdx.x;if (i < n) {c[i] = c[i] + (a[i] * b[i])c[i] += a[i] * b[i];}}HLS C++
void hls_mac(float *a, float *b, float *c, int n) {#pragma HLS INTERFACE m_axi port=a offset=slave bundle=gmem#pragma HLS INTERFACE m_axi port=b offset=slave bundle=gmem#pragma HLS INTERFACE m_axi port=c offset=slave bundle=gmem#pragma HLS INTERFACE s_axilite port=returnfor (int i = 0; i < n; i++) {#pragma HLS PIPELINE II=1c[i] += a[i] * b[i];}}jazir55@reddit
Aren't we at the point where we can just point an agent at it and have them port it themselves without manual intervention given the prompt was good enough?
quinn50@reddit
Yea and no, models are definitely good at translating from lang to lang but I fear many datasets most likely have a severe lack of FPGA programming knowledge.
If you took the time to fine tune the model with a good dataset of docs and projects it may be doable
No-Refrigerator-1672@reddit
I've tried experimenting with Verilog and Qwen 3 Coder Next, back whrn Coder Next was the newest hottest thing. It failed miserably even on the simplest tasks. Proprietary LLMs may be better at this, but local models aren't really trained on FPGA programming.
d1722825@reddit
I don't think so.
HLS can be as fast as VHDL or Verilog, but you need to use it a completely different way that how you would write a C++ code for CPU or even GPU.
Your system probably would need a completely new architecture with a different approach to solve specific problems. If you have that, doing it in HLS or a HDL is not that different, HLS just being a bit higher level.
Also I think this is not the only issue with the idea.
agent00F@reddit
This is very much a project where if you have to ask then no it's impossibly ambitious.
MadGenderScientist@reddit
I'd be absolutely thrilled to program it. I've only done a bit of actual SystemVerilog, but I've done program synthesis (the old-school way - with AnsProlog and other symbolic methods + constraint optimization) and graph algorithm design in the past. I'm salivating at the opportunity to play with exotic hardware and make it sing.
cafedude@reddit
Yes, FPGA coding is very different from what you're used to. I've had pretty good success with getting Claude to translate algorithms to Verilog code, though, and I suspect it could be helpful here.
GravitasIsOverrated@reddit
Yeah, this was my first thought: my understanding is you’d need to translate the tensor file into a hardware description language, which is basically the hard part of Talass’s business. Given they have a team of very smart people working on this problem, it’s unlikely to be within reach of an individual.
No_Conversation9561@reddit
Implementing LLM in RTL requires serious some RTL design skill. Do you have such skill or planning to hire some people for it?
DeepOrangeSky@reddit
What about the opposite end of the spectrum, do you have any ideas on how to get SSD inference to run faster, if someone is using a few sticks of high speed NVMe in combination with each other with raid 0, and maybe trying to use it in combination with a GPU to run MoEs at usable speeds somehow? Are there any bottlenecks that people take for granted that could somehow be gotten around? Or is the NVMe itself so slow that it doesn't matter what you do, there's just no way to get DeepSeek to go at like 20+ tk/s with an SSD setup?
VergeOfTranscendence@reddit
There is a guy doing this already with an FPGA that costs 250-350 dollars, but he tried to train a distilled version of Gemma 4. https://github.com/n57d30top/gemma4-on-FPGA
I think the sweet spot would be running on the FPGA a quantized version of Qwen 3.6 (or 3.5) of 0.8b or 1.5b and use the verifier model on a regular GPU running qwen 27b or 35b A3B. There is a math formula for how much actual speed up you can get from speculative decoding and speculative speculative decoding and if you ask an AI model, it will quickly fetch you the article with the math for it and you can ask it to plug some number for you.
Definitely have a look at that repo, if it were easier to buy that FPGA in Brazil, I would surely have bought it to test.
pmttyji@reddit
When are we gonna get affordable($499-999) LLM burners with big/large models(500B-1T) returns 20-50 t/s?
Pleasant-Shallot-707@reddit
Taalas’s concept seems silly.
Rustybot@reddit
16000 tokens per second per user is a crazy output.
d1722825@reddit
I don't think that is true. Irregular memory access is an issue of how DRAMs work, and not an issue of FPGAs or CPUs or GPUs.
FPGAs can access their internal memory (block ram, distributed ram) much better than CPUs or GPUs their cache, but you only have a few MiB (megabytes) of those at most.
Also 9B model with 182 t/s, if you only need a single multiplication per parameter per output token, that would eat up nearly all your multipliers in the FPGA. For 1456 t/s you would need 6 times more.
I haven't found the spec, so assumed 200 MHz clock cycle for DSP slices (that may be a bit conservative for newer FPGAs). Technically probably you could use a single multiplier to do maybe 4 different 4-bit multiplication at the same time. I'm not familiar with how inference computation scales, but I suspect it needs more than 1 multiplication per parameter per token.
Double_Cause4609@reddit
A) I'm actually not sure that Gemini's plan makes sense. I'd have to think about it, but I think something is getting missed here.
B) The reason Taalas works the way it does is it doesn't have centralized memory (like HBM). You can't really do the same thing they're doing with a general purpose processing unit. It's not really "like Taalas" in any real way. This is completely its own thing.
C) The reason Taalas is appealing is that read-only burnt transistors are way cheaper than central general purpose memory. You don't get the benefits of that on an FPGA (you're still paying for the full possibility space).
The only viable way I see this working is maybe with some sort of weight re-use scheme like recovering a relaxed universal transformer with layerwise LoRAs or something where the LoRAs are read from HBM, and even then, I'm not sure any FPGA out there has enough capacity in transistor count to run a single layer in this way.
D) In verdict, even if you *could* do this with an FPGA, it still wouldn't make sense. The fixed function cards are so much cheaper (even for larger models) that you would feel stupid buying an FPGA to do this, even if it did work. Just give it a year and a half and see how the fixed function cards are working out. The industry will bring solutions. People want to run models efficiently. It'll happen. Give it time.
rowdy_1c@reddit
I mean yeah I suppose you can fit model weights into BRAMs, but as your LLM said it would be a heavily quantized small model. So if you get a really high token rate but your model is lobotomized, how helpful really is that?
SOCSChamp@reddit
I had a similar thought as well when I heard about taalas and tried the demo. I don't think your gemini solution is really what you'd want to do here, taalas actually burned the weights onto the chip.
If you're experimenting with this, you'd want to try building out a small test case with something like gpt 2 and rent an AWS FPGA instance for a few bucks an hour when you're ready to do test runs. I would start with fresh conversations across different frontier models and be very clear about what the goal is, not, "just use an FPGA somehow"
VastCarry@reddit
Had the same thought. Idk why Gemini brought up ram, etc, when the talaas solution is essentially rom. With an fpga you’re going to be very very limited as to the size of that rom. AFAIK, even with an expensive fpga, you’d be looking at a few mb max.
sersoniko@reddit
This will never work, or at least not in any way that is comparable to how Taalas HC1 really works.
The amount of LUT, DSP and logic gates required to “burn” an LLM on a chip is way too much for what any FPGA can provide. And this is what Taalas did, they baked the model into silicon.
What you can do with an FPGA is to design some sort of tensor core, NPU or matrix multiplication pipeline that can process model weights loaded into memory.
This would be similar to how GPU works, and would be completely different from Taalas approach, they didn’t design any general purpose core but instead wrote the actual model in HDL.
KvAk_AKPlaysYT@reddit
Qwen 3.6 27BTi - $999.99
Porespellar@reddit (OP)
LOL that would be amazing! I would be all over that.
nabil_t@reddit
Interesting idea. 9.5k 'poor man' solution gave me a chuckle, but I guess nothing is cheap in the world of AI these days.
Even if the 'speculative decoding' part works, you're still using a highly quantized model right? If so, that wouldn't be that useful for antigenic tasks. Seems less risky to just by an RTX PRO 6000 at that point.
woadwarrior@reddit
FWIW, the Taalas implementation is also “aggressively quantized, combining 3-bit and 6-bit parameters”.
Porespellar@reddit (OP)
I'm definitely curious what the price point of the Taalas HC1 board is going to be when its released. Who knows, maybe it'll be super cheap. I just hope they have a better model then Llama 3.1 on it when it finally ships.
NoFaithlessness951@reddit
My understanding was that their demo board would cost them in the order of $400 - 1000$ per card to mass produce, then add some margin for them on top of it.
Their first "real" product will likely be a card with a newer 20-40b parameter model which would cost them on the order of $1.5k to $5k to produce depending on scale architecture etc. Whether they're going to sell it to anyone (even enterprise) that idk, but they'll at least provide API access to (some) customers.
I'm largely pulling these numbers out of my ass so take this with a grain of salt.
nabil_t@reddit
Yeah, it will be interesting to see if it will be available to retail consumers. It seems like they are working with data centers right now since that's where the money is.
tamerlanOne@reddit
Taalas.com ha senso su auto, droni o robot che devono avere input velocissimi.
Per un uso generico già superare i 250 token /secondo lo vedo overkill per la maggio parte delle applicazioni consumer. A quel punyo meglio avere la metà dei token ma su un modello più grande e performante
Thrumpwart@reddit
You may be able to use autoresearcher to iteratively test the programming of the FPGA gates. This could theoretically speed up programming it significantly.
MelodicRecognition7@reddit
ask Gemini Pro to write the FPGA code for you, if it fails then you're out of luck as nobody else will do that.
Porespellar@reddit (OP)
Yeah, I know Gemini is probably trolling me with a "cool story bro", but I just thought I would see if anyone is messing with FPGAs to do anything interesting related to inference. I don't mind opening myself up to ridicule if it sparks lively discussion :)
youcloudsofdoom@reddit
Ok