Jetson Orin Nano 8GB -- model speed benchmarks

Posted by Forward_Fox1466@reddit | LocalLLaMA | View on Reddit | 5 comments

I’ve been building a fully Local voice assistant on Orin Nano 8GB.

These benchmarks may be of interest to others working with small language models on constrained hardware:

Engine	Mean TTFT	p95 TTFT	tok/s
llamacpp:Granite 3.3-2B	0.09s	0.20s	25.4
llamacpp:Granite 4.0 Micro IQ4	0.10s	0.22s	24.3
llamacpp:Granite 4.0 Micro	0.11s	0.23s	18.9
llamacpp:Granite 4.0 H-Micro	0.13s	0.32s	17.6
llamacpp:Qwen3-4B	0.17s	0.30s	15.1
ollama:Granite 3.3-2B	0.23s	0.33s	25.8
llamacpp:Qwen3.5-2B	0.32s	0.51s	25.1
ollama:Granite 4-3B	0.36s	0.47s	18.5
ollama:Qwen3-4B	0.51s	0.65s	15.5
ollama:Llama 3.2-3B	0.53s	0.61s	19.1
ollama:Ministral-3 3B	0.59s	0.73s	19.5
ollama:Nemotron-3 Nano 4B	1.02s	1.56s	15.6
ollama:Qwen3.5-2B	1.03s	1.31s	22.2

Still a work in progress, especially around barge-in during TTS playback.

Repo: https://github.com/aschweig/jetson-orin-kian

There are also some qualitative benchmarks and more detail in the PDF.

[-]

bitplenty@reddit

You only test llama.cpp? I'm curious if you have ever been successful with vLLM for any of these? For me in practice it always fails.

[-]

Forward_Fox1466@reddit (OP)

ollama and llama.cpp via llama-cpp-python -- so far.

[-]

I have messed around with these for edge projects and it is definitely a balancing act. The memory bandwidth is the bottleneck way more than the compute power. If you stick to smaller quantized models and keep the context window tight you can actually get surprisingly usable token generation rates. It is definitely a fun challenge to optimize for but fr if you are just trying to get chat working you will spend more time fighting with system memory usage than actually running models iykyk

[-]

Forward_Fox1466@reddit (OP)

For experimenting and benchmarking I had to drop the page cache between model loads to accommodate larger models, sudo sh -c 'echo 3 > /proc/sys/vm/drop\_caches'. I also tried to increase the amount of contiguous memory by tweaking sysctl, e.g., raising vm.min\_free\_kbytes and vm.vfs\_cache\_pressure and I think may help in some cases. Overall for a single model load ollama seemed more reliable but slower than using llama-cpp-python.

[-]

No_Fee_2726@reddit

thats diff prespective tbh i didn't gave a single thought abt this