FPGAs for speculative decoding
Posted by dp3471@reddit | LocalLLaMA | View on Reddit | 5 comments
Anyone who knows stuff about fpgas:
- What max model size can one be designed for (I've read 20-30m parameters max, is it possible to go for more if quantized - at a resonable price)?
- Taalas - is what they're doing with asics more viable (rumored? qwen 27b @10k tok/sec at apperantly <$800 hard)
Would speculative decoding here work? Are there other strategies that would be better here, if the smaller model generates at a 100x token speed?
Thanks!
No-Refrigerator-1672@reddit
I'd say you should forget about speculative decoding with dedicated models and not pour your time into outdated technology. Qwen 3.5 models of all sizes include dedicated output layer for MTP that was trained in cojuction with the model itself - and it qchieves better prediction rates with smaller overhead. This is the path forward; another small LLM as draft model is a relic of the past.
ortegaalfredo@reddit
Biggest ones have about 400Mb but they are about usd 10k each, so you are better off buying a rtx 6000
shing3232@reddit
I would imagine that you can run dflash on NPUs
Queasy-Contract9753@reddit
I'm far from an expert and could be wrong,my understanding is you'd be hard presses to fit many GBs in current FPGA. Taalas was a custom designed chip and even the company doesn't seem to have made newer versions since llama 3 8b, on their demo site context is only 6k implying it's not very large on the chip.
Song-Historical@reddit
HILOS and Hillinfer with SmartSSDs is slept on for KV cache offloading and acceleration.