Getting a feel for how fast X tokens/second really is.
Posted by MikeNonect@reddit | LocalLLaMA | View on Reddit | 120 comments
I love following all your adventures with local LLM setups. Quality and size of the models are important, but so is performance. Numbers don't really convey the experienced speed well, however.
If someone claims they run Qwen 3.6-27B at 21 tokens/second, how fast is that? Is 10 tokens/second unusable? I find these numbers objective but meaningless.
I built a script that helps me get a subjective feel for these objective numbers.
It supports text, code and reasoning + code.
Serprotease@reddit
10 tk/s is slow but I will argue represent the bottom edge of useable for thinking models. It’s a bit painful for multi agent stuff/coding but ok-ish for chat.
Below that and you are mostly in the “I’m just happy to be able to run this model on this hardware” level, but not “I can use it to actually do stuff” level.
20-30tk/s is about the same you’ll see with Sota models on api. It’s quite good.
More than that (90+) and you are in the “Don’t really need to bother to do batch calls” level. It’s basically instant.
But in any case, that’s only half the equation. Especially running local, there are ways to speed up tokens generation (like mtp) and even old hardware can have decent results but prompt processing is a lot harder to speed up without just buying a better and expensive gpu.
For example glm4.7 at 50/25 (prompt processing/token generation, M2 Ultra) is basically unusable despite the 25 tk/s tokens generation.
The same model at 500/15 (dual gb10) is workable even if on the slow side.
BringTea_666@reddit
>10 tk/s is slow but I will argue represent the bottom edge of useable for thinking models.
no. It's borderline usable for no-thinking mode. For thinking mode where you have sometimes 100s to 1000s of tokens it is unusable. You'll be waiting 5-10-20minutes for answer.
IrisColt@reddit
I regularly wait 22 minutes to get math problems solved locally, and it's totally worth it.
RevolutionaryLime758@reddit
Learn to do math?
IrisColt@reddit
By the way, today I saw an AI solve locally in literally one minute the same 20-minute problem that was straight-up impossible for the best frontier model in April 2024.
RevolutionaryLime758@reddit
What does this have to do with the fact that you personally can’t do math?
IrisColt@reddit
Actually, I can. I was the one who designed and solved these problems... about 20 years ago, believe it or not, heh
RevolutionaryLime758@reddit
Why would I? You’re trying to tell me how good you think LLMs are at math for no reason like an extreme sperg.
IrisColt@reddit
I feel honored, thanks!
Fugguy@reddit
You did say you like the taste of irony (with a side helping of shredded truth).
Is there any truth to this?
Not trying to be the guy on reddit that says prove it. More like the guy that’s asking for the sauce. The tasty sauce. (I find higher level math quite interesting and am in school for such things right now.) Would love to see what problems these are.
But if linking your work/academic paper/etc would be too revealing for your reddit anonymity, a DM would be appreciated just as well.
IrisColt@reddit
AI is getting incredibly good at math and logical proofs, and I've seen it firsthand. If you're going to be ironic, (and I like to be ironic too) at least base it on a shred of truth instead of being totally detached from reality, heh
Klutzy-Snow8016@reddit
Are people really sitting there staring at the screen waiting for the response to come in? You guys gotta learn how to juggle tasks. Or just surf the internet on the side, listen to a podcast, read, do something.
BringTea_666@reddit
>Are people really sitting there staring at the screen waiting for the response to come in?
Yes. If answer comes in seconds instead of 20 minutes then you go to another task while you are still waiting.
a_beautiful_rhind@reddit
Yes, I'm chatting in a back and forth. More than a minute for a reply is too long.
Plus the reply might be meh and then you have to go again.
Ok_Hope_4007@reddit
This. I would also say that sometimes it encourages people to send requests of higher quality and more planning which I think is a good thing.
Front_Eagle739@reddit
Right? I have fast agentic models and slow "ill go out the kettle on and itll be done later" models.
I dont really give up on a model till it drops below 5 toks or so.
Serprotease@reddit
I mean, I make do with this kind of speed for chat/rag/websearch. It’s about 3-4min for a long reply including reasoning but because I read it as it is streamed it’s still fine.
If you don’t stream it, then yes, it’s too slow.
But I rarely see more than 2k/3k tokens for a reply. Unless it’s code of course but then I’m fully ok to wait, because it means it’s work so I’ll be doing something else in the meantime.
soshulmedia@reddit
... which can still be useful if you just communicate asynchronously with your LLM, like through email.
Heck, I think even ~1 tok/s can be okay for a thinking model if running overnight on a couple of e.g. test prompts.
miversen33@reddit
Prompt processing is by far my biggest bottleneck right now. Id love to see some improvement in that area. I can push 40-60ish t/s on Gemma 4 MOE but prompt processing brings the entire thing to a crawl
YetAnotherAnonymoose@reddit
50 is really usable, but 150 feels blazing
hlacik@reddit
i find 20 tok/sec speed comparable to what usually chatgpt or claude gives you. anything less feels slow to me.
i am running qwen3.6 35B at 20-25 tok/sec rn and i am happy with it
iezhy@reddit
no its not
my opencode benchmark (same small app according specification) takes 10-15x longer with Qwen3.5-35b locally ar 25tok/s, compared to calling OpenAI api
xtekno-id@reddit
Wow just noticed, thats 20tps actually is quite fast 👍🏻
pmarsh@reddit
Great idea! I can now can also benchmark myself... Anyone want to offer what they feel are their tok/sec?
thatcoolredditor@reddit
Awesome project thanks
LosEagle@reddit
As someone who has 16gb of vram and runs local llms for 2.5 years when MoE was not a thing and quants were lobotomizing I learned to get used to 3.10 t/s being usable for some tasks :]
Such_Advantage_6949@reddit
This is nicest one among all token visualizer i saw. Well done
Maleficent-Ad5999@reddit
I tried visualizing 1 t/s and it was... painful
psylenced@reddit
It's still faster than I type!
Such_Advantage_6949@reddit
Prompt processing is even more painful
-p-e-w-@reddit
That’s awesome!
This sub needs a community showcase where such projects are permanently listed so they don’t disappear into obscurity after 3 days.
MikeNonect@reddit (OP)
I agree. Ideally, when talking about the token speed they get on their local hardware, community members should be able to share an easy link: https://mikeveerman.github.io/tokenspeed/?rate=11.7&mode=think&think=6
Ok_Substance2327@reddit
Hm pretty cool but I do just give it test tasks to complete and observe how fast it feels, also judge quality at the same time.
MikeNonect@reddit (OP)
Sure, if you can run the model locally, you know how fast it runs on your hardware.
But if someone claims to be able to run Gemma 4 at 11.7 tokens/second, how slow is that actually? Well, this slow: https://mikeveerman.github.io/tokenspeed/?rate=11.7&mode=think&think=6
SmartCustard9944@reddit
I feel like with current local models 60-100 is the sweet spot.
Faster and you don’t have a chance to catch potential thinking mistakes and such.
Or maybe it’s just a cope because I don’t have a $50k workstation sitting in my office.
sremes@reddit
Does anyone actually read the thinking traces as they come out?
SmartCustard9944@reddit
I do, a bit of a micromanager while I am attesting the quality of the models
Orolol@reddit
The more speed you have, the better it is for agentic workload. For example, when you do some deep research, you need to have lot of parallel agents, reading, analyzing and writing important information about the document. The faster you are, the more documents you can retrieve.
alphapussycat@reddit
Why would it be better for agentic? Is it the opposite, for non agentic workloads, like Claude code, having high token generation is important.
Orolol@reddit
For non agentic there's always a limit to the workload speed : you. Of you want to read the output, you'll be the bottleneck quite quickly. For agentic workload you usually don't read individual tool calling
jazir55@reddit
For vibe coding I would argue its essential. I have caught so many architectural mistakes and bad design decisions as well as if they just go completely in the wrong direction, decide to delete something, refactor it in a bad way, etc. I'm a complete novice at coding, I can somewhat interpret it but I can't write it myself. I've been learning a lot just by reading the generated code and the reasoning traces. I wouldn't be able to vibe code any even moderately complex project unless I was scanning the outputs constantly.
alphapussycat@reddit
But you could just start a bunch of tasks, while you work on your end, then check back in when it's done, for agentic workloads.
ArtfulGenie69@reddit
You don't need 50k for that kind of speed even on the 122b qwen. With 4 3090's split over two computers on 2.5g Ethernet with rpc in llama.cpp and no mtp I get 800prefill and 55t/s. With mtp i could hit 100t/s. Funnily enough all the hardware was significantly cheaper a year ago but you could get it done with like 5k today I would hope.
autisticit@reddit
It's not a cope at all. I totally agree with you. We totally don't need faster speed.
soshulmedia@reddit
/s
MikeNonect@reddit (OP)
Thanks for all the great feedback, everyone! I've shipped several of the features you suggested:
* Natural text: I've merged a PR replacing the ipsum lorem in text mode with a more natural Wikipedia article.
* Agent mode: Simulates an agentic workflow with alternating tool calls and code generation.
* Think length slider: When in think mode, you can now control how many reasoning sentences the model "thinks" before generating code.
* Custom text/code: You can now paste or upload your own text or code and stream it at any speed.
* Token counter: A live count of tokens generated, displayed in the footer.
* Share links: The rate and mode are encoded in the URL, so you can link directly to e.g. "what 10 tok/s looks like in code mode." There is also a share button for this.
Try it out: https://mikeveerman.github.io/tokenspeed/
Successful_Plant2759@reddit
This is useful because tokens/sec only becomes meaningful when paired with task shape. For chat, 10-15 can feel acceptable. For long code diffs or reasoning-heavy output it feels slow because you wait through big preambles. For autocomplete, even high throughput can feel bad if time-to-first-token is high. A separate TTFT slider or display would make the simulator even more practical.
JayPSec@reddit
Very cool!
This is absolutely the kind of stuff we need here. It brings some intuitiveness to a overcrowded number arena. Well done!
FatheredPuma81@reddit
Something to improve this. I think you should add a Think + Output mode with a customizable think token length. Qwen especially can feel very very slow at times because it will sometimes spend 1000 tokens thinking and other times 30,000 tokens thinking depending on the input. At 150t/s the former is more than usable while the latter is... pain.
Also maybe allow us to resize the text window? It's hard to properly get a sense of speed for 500t/s+ with it being so small.
MikeNonect@reddit (OP)
Both good points of feedback! I'll look into it.
metalvendetta@reddit
Great job! Is it limited to local llm setups, or can it be integrated as mcp to claude, codex etc?
MikeNonect@reddit (OP)
It's a simulator. It shows an approximation of what it would feel like to run a model at e.g. 1.4 tks/sec
lnris@reddit
What about input tokens too? user past his prompt and he sees how much time will take for the model to read it
MikeNonect@reddit (OP)
I like this idea!
darkoromanov@reddit
Thanks, that's very useful
Enough-Astronaut9278@reddit
for gui agent stuff latency matters more than throughput imo. if the model takes 3 sec to decide where to click next the whole thing feels broken. at \~70 tok/s on apple silicon w a 4B quantized VLM each step is under a second which is juuust fast enough to not be painful. still not great but usable
No-Upstairs-4031@reddit
Thank you! This is the best visualization I ever seen.
InvestmentBiker@reddit
Local models are underrated for this.
Not because they replace frontier models, but because a lot of daily workflow tasks don’t need massive cloud inference...
For small repetitive tasks, local + private + cheap may actually be the better direction.
Fringolicious@reddit
This is a brilliant idea, it's hard to get a real idea of what usable looks like and this makes it super obvious
MikeNonect@reddit (OP)
Yes, that said: it's a simulation and there are probably some naive assumptions in the code.
SmartCustard9944@reddit
Next step, would be interesting to have an agentic simulation, where it shows PP speed instead. Maybe something as simple as read file, generate code, read file, and so on.
IrisColt@reddit
P-PP speed?
Imaginary-Unit-3267@reddit
Some guys have a faster PP than others.
(it means prompt processing if you're not aware)
IrisColt@reddit
Understood. I was slightly off track for a moment. :)
techdevjp@reddit
Yeah, including the time to first token delay would be a helpful addition too. 10tok/sec might be quite usable in some situations but not if it takes 20sec to get the first token out.
Fringolicious@reddit
Yeah I mean, it's not going to show you exactly in detail what's going on but if I just want a ballpark "How fast is this compared to my reading speed?" it does a great job
Maybe some people might find it useful to be able to upload a sample file containing text / code and have that regurgitated back?
admajic@reddit
30 is way faster than you can read 40 to 50 is better Running qwen3.6 35b at ave 150, 200 max is where it's at...
New_Zone5490@reddit
this made me realize i was getting ~0.3 tps when i tried qwen3.6-27b on my current laptop
my laptop is rtx 5070 (mobile version with 8gb vram) + 32gb ram + fedora linux
i cant wait to get new hardware
Dazzling_Equipment_9@reddit
This is great, I love this kind of simple yet practical thing.
LagOps91@reddit
what tokenizer are you using here? code seems strangely slow in comparison.
MikeNonect@reddit (OP)
It's a simulator, not a real tokenizer. Code feels slower because you're not getting one-word tokens that often.
It's an approcimation of course, so if you have suggestions to improve it, please review the code https://github.com/MikeVeerman/tokenspeed/blob/master/index.html . All help is welcome.
LagOps91@reddit
well, why not use a real tonkenizer then? there should be some lightweight stuff you can use on the web. there are also other websites that show how a string of text would tokenize.
code actually generates quite quickly typically, so the simulation is off for sure.
MikeNonect@reddit (OP)
It's an approximation. If you feel you can do better, feel free to open a pull request.
LagOps91@reddit
i am aware that it's an approximation, but why do you not just use an actual tokenizer?
either way, i will not do a PR for this. please look into how to add an actual tokenizer yourself.
c_pardue@reddit
just let his thing not be tokenizing properly and move on
Express_Quail_1493@reddit
AMAZING. this thread needs more of the "feels" of things
cleversmoke@reddit
Awesome and thank you! I'm at 25 tok/s and it's very usable. I cannot wait for MTP for ~40-50 tok/s and an upgraded GPU for 60-80 tok/s! The dream set up for me.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
DaMan123456@reddit
Love it
white_reaper002@reddit
I think if you're resource limited that multiple smaller dedicated local models work better. As they are incredibly fast plus finding their custom made models on hugging face like crow-9b made from qwen3.5-9b. I get around idk 100/tokens or more and its fast like instantly getting a report on an error log.
Mickenfox@reddit
I swear I saw a website exactly like this a long time ago.
natermer@reddit
10 t/s is going to feel slow. Like you are watching somebody typing stuff out.
21 t/s more like conversational. It is kinda how fast you'd expect a computer to be spitting out text for you to read.
10 t/s would be fine for batch or autonomous agent use as long as you are not in a hurry to get stuff done and you don't have to be there to interact with it.
20 t/s is fast enough that it is more 'conversational' mode. It can go faster then you can likely "read with understanding".
It wouldn't be great if you are trying to design something interactive. Like you press a button and expect something to happen. Especially if you are using a model that has "reasoning mode" enabled.
Once you get up to 80 t/s or 100 t/s then it starts getting into the more "instantaneous" realm. Not quite there, but getting up there.
Due-Advantage-9777@reddit
It could be useful to have a token counter too for what has been generated!
It feels a tad faster than reality for the thinking for instance as there can be 'harder' token generated in the thinking such as code or validation mark/cross etc that i suppose takes a bit longer to generate.
Equivalent-Costumes@reddit
I'm confused. Each models have its own tokenization algorithm, so they are not the same isn't it? Also, it feels a bit slow. Did you simply do "1 character=1 token"? I meant, to be fair, people making claims on the Internet about token generation speed probably count tokens that way as well.
MikeNonect@reddit (OP)
It's an approximation for sure, but it's not one character / one word = token.
The cool thing about open source is that you can read how it works and then improve it once you are less confused. https://github.com/MikeVeerman/tokenspeed/blob/master/index.html
Happy to see your PR.
letsgoiowa@reddit
I'm fine with 5 tokens a second because I'll just alt tab and get back to doing something else and come back when it's done
dtdisapointingresult@reddit
Your Think + Code tab is very unrealistic.
To simulate the most popular local model, Qwen, it should be 3k tokens of think followed by 1 function.
FishermanTiny8224@reddit
Pretty cool thanks for sharing!
FastDecode1@reddit
This is a great idea.
Suggestion: add "lines of code per second/minute/hour" as metrics to the code section. Could be useful for ballpark estimates of task length (or not, given how ambiguous of a unit "line of code" is).
sirusxx@reddit
Just a brilliant idea
iamapizza@reddit
The text should be navy seal copypasta on repeat
AustinM731@reddit
I have seen a few of these over the past few years. But this is by far the best one that I have seen.
Far-Review-9369@reddit
Simple, but sweet! Thanks for sharing
stddealer@reddit
Comparing tokens/s across different models is also a somewhat flawed metric because every model family has its own tokenizer, and depending on the tokenizer, the same sentence might have a very different token count. Maybe counting words/s would be better, but that also depends on things like language.
ThePixelHunter@reddit
Love the UI, thank you
Prestigious_Thing797@reddit
Really puts in perspective how slow manual coding was before these models. I probably would hit 2 or 3 tokens/s for short bursts on a good day. Planning and other tasks would be faster ofc, but still.
Even at 2 tokens/s if you can run it all the time in a good agentic loop. That can get some real work done.
aguspiza@reddit
For thinking models you would need 100-200tk/s to be productive steering the LLM... for non-thinking ones just 30-40 tk/s is enough. If you prepare the work properly with a faster one and let the slow LLM go on autopilot (with proper filesystem and network controls), even 20 tk/s is enough.
j_osb@reddit
Most good cloud models don’t even break 100t/s, what are you on about?
Anything above 30 is fine, even for thinking models.
aguspiza@reddit
Most cloud models do not even show you the thinking process... what are you talking about?
TrainingTwo1118@reddit
? DeepSeek shows you the reasoning process, same for Claude.
aguspiza@reddit
No, not always. https://www.reddit.com/r/ClaudeAI/comments/1rtibjo/claude_code_now_hides_its_reasoning_where_is_it/
TrainingTwo1118@reddit
Thanks for the link, wasn't aware of that!
wllmsaccnt@reddit
You don't need cloud model speed to be productive locally (though it does feel nicer). I might tend to agree with you about Qwen3.6 models though, I swear they've been trained that the thinking tokens are the whole point.
Have you tried using thinking budgets? I wonder if those would help.
aguspiza@reddit
I use Claude Pro and when thinking it does like 250 tk/s (I do not see the thinking tokens, that number is my guess just looking at the token count going up 😄)
aguspiza@reddit
https://openrouter.ai/inclusionai/ring-2.6-1t:free
This week free model: Ring 2.6 *1T* (63B active):
Tokens per second \~95,8 tokens/s
Token count 515 tokens
Cost $0
Duration 5,4s
Serprotease@reddit
Throughput is 25 tps? Aren’t you sure that your data also includes prompt processing?
aguspiza@reddit
it is free... i can test it yourself... in my test, what I saw was way faster than 25tk/s
aguspiza@reddit
Even a RTX 5090 32GB can do 600 tk/s with gemma4 26B (with speculative decoding)
https://www.reddit.com/r/LocalLLaMA/comments/1t796qe/gemma_4_26b_hits_600_toks_on_one_rtx_5090/
DifficultDog8435@reddit
10 t/s can be totally fine for short replies, but miserable if you’re waiting on a big code explanation or a reasoning-heavy answer. 20+ t/s usually starts feeling usable/interactive, but even that depends on the model. A smarter 27B at 15 t/s can feel better than a weaker small model at 40 t/s if it needs fewer retries.
blackashi@reddit
Yoooooo. Nice
caetydid@reddit
You are addressing an important point. For me between 100-200t/s is very comfortable, 50-100 ok, 20-50 starts being too slow.
HavenTerminal_com@reddit
genuinely had no idea 10 t/s would feel that slow
Samurai_zero@reddit
Best post of the week. Best local tool of the month, so far.
ComplexType568@reddit
I SUPPORT, I can only dream of a 2k t/s model that ISNT a 4000 parameter model
Mordred500@reddit
This is great, really puts things into perspective, thanks for sharing!
TechExpert2910@reddit
awesome :)
SaltAddictedMan@reddit
This is great, nice work
MikeNonect@reddit (OP)
There is also a Python version because this subreddit is about running things locally, after all: https://github.com/MikeVeerman/tokenspeed
Alarming-Ad8154@reddit
Briljant!
pantalooniedoon@reddit
That’s excellent