Jetson Orin Nano 8GB -- model speed benchmarks
Posted by Forward_Fox1466@reddit | LocalLLaMA | View on Reddit | 5 comments
I’ve been building a fully Local voice assistant on Orin Nano 8GB.
These benchmarks may be of interest to others working with small language models on constrained hardware:
| Engine | Mean TTFT | p95 TTFT | tok/s |
|---|---|---|---|
| llamacpp:Granite 3.3-2B | 0.09s | 0.20s | 25.4 |
| llamacpp:Granite 4.0 Micro IQ4 | 0.10s | 0.22s | 24.3 |
| llamacpp:Granite 4.0 Micro | 0.11s | 0.23s | 18.9 |
| llamacpp:Granite 4.0 H-Micro | 0.13s | 0.32s | 17.6 |
| llamacpp:Qwen3-4B | 0.17s | 0.30s | 15.1 |
| ollama:Granite 3.3-2B | 0.23s | 0.33s | 25.8 |
| llamacpp:Qwen3.5-2B | 0.32s | 0.51s | 25.1 |
| ollama:Granite 4-3B | 0.36s | 0.47s | 18.5 |
| ollama:Qwen3-4B | 0.51s | 0.65s | 15.5 |
| ollama:Llama 3.2-3B | 0.53s | 0.61s | 19.1 |
| ollama:Ministral-3 3B | 0.59s | 0.73s | 19.5 |
| ollama:Nemotron-3 Nano 4B | 1.02s | 1.56s | 15.6 |
| ollama:Qwen3.5-2B | 1.03s | 1.31s | 22.2 |
Still a work in progress, especially around barge-in during TTS playback.
Repo: https://github.com/aschweig/jetson-orin-kian
There are also some qualitative benchmarks and more detail in the PDF.
bitplenty@reddit
You only test llama.cpp? I'm curious if you have ever been successful with vLLM for any of these? For me in practice it always fails.
Forward_Fox1466@reddit (OP)
ollama and llama.cpp via llama-cpp-python -- so far.
No_Fee_2726@reddit
I have messed around with these for edge projects and it is definitely a balancing act. The memory bandwidth is the bottleneck way more than the compute power. If you stick to smaller quantized models and keep the context window tight you can actually get surprisingly usable token generation rates. It is definitely a fun challenge to optimize for but fr if you are just trying to get chat working you will spend more time fighting with system memory usage than actually running models iykyk
Forward_Fox1466@reddit (OP)
For experimenting and benchmarking I had to drop the page cache between model loads to accommodate larger models,
sudo sh -c 'echo 3 > /proc/sys/vm/drop\_caches'. I also tried to increase the amount of contiguous memory by tweaking sysctl, e.g., raisingvm.min\_free\_kbytesandvm.vfs\_cache\_pressureand I think may help in some cases. Overall for a single model load ollama seemed more reliable but slower than using llama-cpp-python.No_Fee_2726@reddit
thats diff prespective tbh i didn't gave a single thought abt this