Which LLM (or SLM?) model can I use as a benchmark to target resource constrained edge devices? (INT8 quantised 100M-200M parameters)
Posted by neuroticnetworks1250@reddit | LocalLLaMA | View on Reddit | 7 comments
I am currently building up on an open source repo with a riscv controller and a vector unit and has incorporated a tightly coupled matrix unit as well. I might also try to add a dedicated Softmax unit if RVV instructions for Softmax becomes a bottleneck. Is there a list of models on hugging face perhaps that we can use (associated papers would be good) as benchmarking options?
OkAssistance7886@reddit
For that size range probably look at tinyLlama style benchmarks, SmolLM, MobileLLM, qwen small variants, and older distilled models then compare tokens/sec, memory use, and accuracy after INT8. Since your target is edge hardware, raw benchmark score might matter less than how cleanly the model maps to your vector and matrix units.
ffgnetto@reddit
Gemma3 270m It
Chromix_@reddit
Falcon-H1-Tiny-90M which is also available as reasoning model. Bring that down to Q8 (and maybe, maybe Q4) and you have something nice and small that gives you tokens per second instead of seconds per token. There's also a variant optimized for tool calling, which might be more preferable for some scenarios with these tiny devices.
neuroticnetworks1250@reddit (OP)
Since I’m a beginner, I just want to clarify. Q8 means a fixed point quantized 8 bit integer, right? Not some FP8 or something? I intend to run the matmul operations on my matrix unit and then have the vector unit do the requantisation from 32 bit pitpits back to INT8
GrokiniGPT@reddit
I hope you don't have him generating more than the letter "a" because you can't do anything with a 0.2b parameter model
neuroticnetworks1250@reddit (OP)
Lol. Tokens generated has more to do with the hardware than the model itself, right? I actually tried out smollm-2-135M-Instruct-q8 on llama.cpp and it was kind of decent for coding despite being absolute bull for anything else
GrokiniGPT@reddit
no... imagine a toddler's knowledge vs an adults knowledge... tjats the difference