Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?

[-]

pfn0@reddit

when chatting, most of the prefill is cached by prefix, so re-processing doesn't end up costing anything. it mostly matters when you're doing agentic work and processing tons of data. so for many people that are not using it for the purpose of coding and data processing, token generation speed dominates the concern.

[-]

wbulot@reddit (OP)

When I’m just chatting, neither prefill nor generation speed is really an issue. Context stays pretty low, and generation only needs to keep up with reading speed.

It’s the agentic workflows that really expose the need for better optimization. I feel like prefill performance hasn’t kept up with the long contexts we actually use in coding/agentic tasks, while generation speed is far less of a problem. I might miss a lot of use cases of course that require high t/s.

[-]

Badger-Purple@reddit

90% of sub posters don’t even run their own local models. It’s a group of highly knowledgeable folks with great tips and tricks, and a gaggle of newbs and fanbois who have never stepped beyond openrouter.

You’re absolutely right about prefill. Case in point are macs. I just ran a simple prompt in minimax m2.5 ==> 45 tps!! woahh… Except, try giving it 10000 tokens (agent sys prompt) and see how long it takes to prefill…

[-]

pfn0@reddit

my understanding is that mlx aggressively does prefix caching, e.g. using lmcache to reduce recalculating prefill. agents and harnesses that break the prefix are pretty generally hated in these environments.

[-]

Badger-Purple@reddit

tbh you can put lipstick on sarah palin and she still is sarah palin. There is a limit to what compute units in mac without matrix multiplication cores can do.

[-]

odragora@reddit

10k tokens system prompt is generally considered something completely non-viable for local models, especially on the hardware available to most enthusiasts.

Pi for example has around 1000 tokens system prompt.

[-]

kevin_1994@reddit

Not true at all. With dedicated GPUs you can get 1500+ pp/s which chews through 10k in only a couple seconds. I use qwen 3.5 9b q8xl as a subagent for long documents, codebase exploration, etc and it has 4000 pp/s.

Prefill needs tensor cores and macs, strix halo, etc. just don't have enough of them.

[-]

gtrak@reddit

This is not an issue at all on 24GB vram. I can run agentic workflows in opencode exclusively on my 4090 with 160k context at q8 and still use it to render my desktop. Results are good.

[-]

odragora@reddit

I would say 24Gb is far above an average local models enthusiast.

[-]

gtrak@reddit

You can run an MoE model on less or 27b with turboquant on 16.

[-]

odragora@reddit

MoE 27b and dense 27b are two very, very different things.

[-]

gtrak@reddit

I didn't mean to imply that. But you can run 35b-a3b with a large context on less vram and get acceptable speeds. I'm not doing that because 27b is a different league.

[-]

odragora@reddit

Well, then we are in a situation where a VRAM limited user faces a choice between the model league, the context size and a 10k system prompt.

In my opinion it's always much better to prioritize the model quality and the available context window over 10k tokens system prompt.

[-]

gtrak@reddit

I see a tradeoff in my own time if I try to nickel and dime context, and at 160k I can get decently scoped tasks done without thinking about it too hard. 128k was a little too tight and hits compaction too often. Yes, a 10k system prompt is wasteful, but it might take me some time to replace the parts of it that are useful and I'm habituated to. When i tried pi, I could get it to do stuff, but it's optimizing for something different. I don't want to be locked into a tui all day. I have long-running agents in the background while I investigate the next piece of work.

[-]

afd8856@reddit

what's the model, params, settings and performance? I have the same setup, 24GB, really struggling to balance context, model size and perfomance

[-]

gtrak@reddit

I get 40 token/s. I'm switching out the model all the time, but this variant is what I have today. Skip mmproj for more savings, I just download the gguf directly or you can run with --no-mmproj. I have HW acceleration turned off in my desktop browsers, slack etc to free up more vram.

./llama-server \
      --port 1234 \
      --host 0.0.0.0 \
      --model "models\Qwen3.6-27B-Claude-Opus-Reasoning-Distill.q4_k_m.gguf" \
      --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 1.5 \
      -fa on -t 16 \
      -ctk q8_0 -ctv q8_0 \
      --ctx-size 160000 \
      -kvu \
      --no-mmap \
      --parallel 1 \
      --cache-ram 24000 \
      --spec-type ngram-map-k \
      --spec-ngram-map-k-size-n 16 \
      --spec-draft-n-min 12 \
      --spec-draft-n-max 48 \
      --chat-template-kwargs '{"preserve_thinking": true}' \
      --chat-template-file "models\qwen3.6-enhanced.jinja" \
      --seed 3407 \
      --jinja

| NVIDIA-SMI 596.36                 Driver Version: 596.36         CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  |   00000000:01:00.0  On |                  Off |
|  0%   41C    P2             69W /  450W |   23886MiB /  24564MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             512    C+G   ...SnippingTool\SnippingTool.exe      N/A      |
|    0   N/A  N/A           12856    C+G   C:\Windows\explorer.exe               N/A      |
|    0   N/A  N/A           22264    C+G   ....0.3912.98\msedgewebview2.exe      N/A      |
|    0   N/A  N/A           24640    C+G   ..._cw5n1h2txyewy\SearchHost.exe      N/A      |
|    0   N/A  N/A           28008    C+G   ...yb3d8bbwe\Notepad\Notepad.exe      N/A      |
|    0   N/A  N/A           29760    C+G   ...y\StartMenuExperienceHost.exe      N/A      |
|    0   N/A  N/A           31972    C+G   ...yb3d8bbwe\WindowsTerminal.exe      N/A      |
|    0   N/A  N/A           42024    C+G   ...IA app\CEF\NVIDIA Overlay.exe      N/A      |
|    0   N/A  N/A           43920    C+G   ... Magician\SamsungMagician.exe      N/A      |
|    0   N/A  N/A           46772    C+G   ...5n1h2txyewy\TextInputHost.exe      N/A      |
|    0   N/A  N/A           48060    C+G   ...SnippingTool\SnippingTool.exe      N/A      |
|    0   N/A  N/A           50040      C   ...ev\llama.cpp\llama-server.exe      N/A      |
|    0   N/A  N/A           51000    C+G   ....0.3912.98\msedgewebview2.exe      N/A      |
|    0   N/A  N/A           52844    C+G   ...IA app\CEF\NVIDIA Overlay.exe      N/A      |
+-----------------------------------------------------------------------------------------+

[-]

kaisurniwurer@reddit

you can easily fit 64k context with gemma4 31B.

10k for instruction and logic data is nothing. It being a system prompt doesn't mean much for the "viability" of the system.

[-]

odragora@reddit

Except Gemma 4 31B doesn't even fit on 16Gb VRAM GPUS with anything above Q3, which already leaves very little space for context, and most people using Huggingface have 8GB.

[-]

kaisurniwurer@reddit

10k tokens system prompt is generally considered to be something completely non-viable for local models

10k tokens system prompt is absolutely vialble for local models. And no one considers 10k prompt something non-viable.

For 8GB VRAM, Gemma4 26B-A4B, or Qwen 3.6 35B-A3B is very much capable of the same, using hybrid inference at decent speed. Again nothing outlandish about 10k prompt.

[-]

odragora@reddit

At the very least the creators of the most successful / performant agents designed for local models, such as Pi and agents based on Pi like little-coder, do consider 10k system prompts non-viable, have 1k system prompts, and claim that 10k system prompts work okay with cloud models and do not with local ones.

You are talking about MoE models. They are significantly far behind similar size dense models in capabilities and reliability.

[-]

Imaginary-Unit-3267@reddit

Why are people who don't use local AI on the local AI subreddit? There should probably be a way of vetting that too, alongside the bots.

[-]

arcanemachined@reddit

Dude you're on EternalSeptember.com, we're all idiots here.

[-]

gtrak@reddit

Oh I'm not reading all that. I constantly run subagents and longer threads. If I want to read it, I have it create an artifact when it's done. The faster the better.

[-]

Tormeister@reddit

15 t/s TG indeed is fine for conversation but it is infuriatingly slow for coding.

When you get 5000 - 10000 tok/s PP, and the entire context is already cached, and you're adding 500 - 20k tokens per message round, all you care about is TG.

It's all about the use case combined with the available hardware.

[-]

FullOf_Bad_Ideas@reddit

Less than ideal but I think 15 t/s TG is fine for coding - usually most of the time is spent on prefill, repo research and planning anyway. I mean I use it this way, around 300 - 600 t/s PP and 18-30 t/s TG. It could be faster, but it's not infuriatingly slow for me. I run Qwen 3.5 397B locally.

[-]

Temporary-Sector-947@reddit

I second this.
Prefill is a real bottleneck in real scenarios.

[-]

nomorebuttsplz@reddit

using GLM 5.1 as main planner and coder on mac m3u, and then using qwen 3.6 35b as explore and other smaller agents, I find prompt processing time is similar to token gen time, developing small apps in python and xcode.

The smaller LLM with 1k t/s pp time will tell the large one which sections of code to look at within individual files, making the slow pp of glm 5.1 mostly a non issue.

[-]

yes_i_tried_google@reddit

I get the feeling your prompt cache isn’t working effectively then. I found with opencode for example a load of old open bugs that change every prompt and culminate in rendering caching broken.

Below I put my results after getting slots working properly.
https://www.reddit.com/r/LocalLLaMA/s/KeVFgnISEE

[-]

wbulot@reddit (OP)

I regularly check my cache hit rate and it does seem to be working fine. I’m not sure how many people here actually work on large codebases with local LLMs, but in agentic workflows the harness usually has to ingest 50k+ tokens of context before it can even begin doing anything. So even with a working cache, you’re still waiting for those 50k tokens to be processed.

That’s why the tokens-processed vs tokens-generated ratio is so heavily skewed in agentic use cases. For me, that’s exactly why prompt processing speed feels 10x more important than generation speed.

[-]

yes_i_tried_google@reddit

In my test using slots it takes <5 seconds to ingest 100k prompts, not sure why it would be different for you if you’re hitting cache

[-]

wbulot@reddit (OP)

Because it has nothing to do with cache.

Your harness reads one file (let’s say 5k tokens). Then it decides it needs another file, so now you have to process 5k more tokens on top of what’s already in context. Cache will skip the previous tokens, sure, but you still have to process all the new ones.

Meanwhile the model only outputs something short like “read this file” — just a handful of tokens generated — while you just burned thousands of tokens on the prompt side.

In real agentic work on an actual codebase, this keeps happening: the model reads file after file, steadily pushing context up to 20k, 30k, or even 50k tokens. The ratio is completely lopsided. At the end of the day you’re mostly waiting for the model to finish processing the prompt, not waiting for it to generate the next reply.

[-]

nomorebuttsplz@reddit

The way I get around this is by having a smaller sub agent explore the files, and give the main model a sense of what part of it it needs to edit. Then it doesn’t have to read the whole thing.

[-]

ItilityMSP@reddit

Change the way you code and make it more modular so models don't have to ingest the whole code to make a change, and focus on good modular software development with clear seams. And if you are doing enterprise codebases get them to pay for the hardware, or subscription plan. If you are writing your own code change the way you and your models code, don't let them create files with 20 functions and 40 subfunctions, but One function file and one file per subfunction, create a data architecture, and standards for names, make your agents use those standards and data architecture. Yes it means you need to have a clear idea of what you want the software to do at each layer, interface, object, process, database but spend that time upfront and agents don't need to know everything.

[-]

Imaginary-Unit-3267@reddit

Exactly right. And in other words: actually bother engineering software yourself, and use your head for more than a neck ornament. Sadly a lot of vibe coders don't bother with this important step.

[-]

SLxTnT@reddit

Are you offloading to CPU? Those are the prompt processing speeds I get when I do.

[-]

u23043@reddit

Yes, but prompt processing is usually orders of magnitude faster than token generation. If you have a workload where 99.9% of tokens are input tokens this might matter, but in reality both matter and token generation is often the bottleneck (at least for reasoning models)

[-]

Valuable-Run2129@reddit

Even if cache works fine all harnesses break cache at compaction and it’s 5 minutes of waiting. Or just big files in tool outputs. It’s unbearably slow at 200 tokens per second.

[-]

Several-Tax31@reddit

Thanks for this.

[-]

mild_geese@reddit

Both are important for agentic stuff

[-]

silentus8378@reddit

This is why I still don't think local AI is as viable as hype indicates. For my average use, I really need enterprise grade gpu for local ai but too expensive 😞

[-]

kaisurniwurer@reddit

In my experience, PP and TG is 50/50 time wise on 3090, but I'm quite optimised for a specific case. Not terrible but it does need some work to make it happen.

In recent SWA models processing is noticeably longer too compared to mistral for example so model choice also matters.

[-]

OddDesigner9784@reddit

Most conversations you are starting from a small amount of context and working up. The prompt gets cached as you go in regular system ram but thinking can be the real slowdown. It's how fast can you execute a task

[-]

kaisurniwurer@reddit

For conversations it's true. For more convoluted workflows, you usually don't have as much of static context.

[-]

AeroelasticCowboy@reddit

Totally agree, my biggest use case for LLM is home automation, to replace what a Google home or Alex does, for this to work any automation ask requires the prompt to populated with live states of every sensor in the home along with current date/time, etc, So those tends to be 6,000-9,000 token promopt just to turn on a lightbulb with a extremely short response, so literally all that matters is prefill speeds. Ideally PP around 3,000/s or better provides a fluid experience with voice control.

[-]

ThePixelHunter@reddit

Try to take advantage of prompt caching. If you organize the prompt so that entities which tend to have static states (lightbulbs on, etc.) are towards the beginning of the prompt, while more variable states (the time, the temperature, etc.) are towards the end, you'll only need to recalculate context from the first modified token onwards.

I also use an LLM for my Home Assistant setup, and while I don't know how to change what order the tools get inserted into the prompt, it's worth looking into.

[-]

CarelessSpark@reddit

If you have custom dynamic info in your system prompt then that can be optimized but HA injects a bunch of extra stuff on top of your system prompt. It's particularly bad at caching with switching between multiple clients (such as multiple voice satellites) but there's work to improve it underway.

2025.5.0 also added filtering to GetLiveContext so it can check for specific devices or areas instead of dumping the entire home state into context.

[-]

AeroelasticCowboy@reddit

I'm not sure you can, in fact I think time might be one of the first ones. My testing showed me, more or less that you need the language processing portion of the pipeline to happen in 3 seconds or less or the system feels clunky and you get annoyed at the wait. 2 seconds is even better. Occasionally I get responses in under 2 with my r9700 and qwen 3.6 35a3b at Q5K , but the sense model is much slower, 6+ seconds

[-]

corruptbytes@reddit

i think prefill is definitely important, tool parsing is also pretty important, cache management

lot of things i’ve seen be more of a pain than tok/s

i have gotten decent results from omlx ssd cache

hoping things like PFlash are proven out

[-]

corruptbytes@reddit

i ended up vine coding PFlash for rapid mlx and then i saw the OG implementation couldn’t really PFlash tooling bc of JSON structure, so i vibe coded just a simple mining tooling thing, prefill is looking not bad!! we will see after running bench marks overnight

[-]

akumaburn@reddit

If you're doing agentic coding 15 tokens a second is extremely slow, to put that into perspective. Agentic coding plans can typically do 100 tok/s and even then it can feel slow.

[-]

power97992@reddit

Dude i use the api , i don’t usually get more than 60tps of decoding. >100 t/s is usually for small models or special chips… Even

[-]

TheRealMasonMac@reddit

Fireworks has some weird magic to get both insane prefill and token speeds. Prefill is practically instant, and even with big models generation is \~100TPS with the turbo models.

[-]

power97992@reddit

Prefill should feel instant since a b200 has 20 petaflops of fp4 , that is around 60 thousand tk/s of prefill for deepseek v4 pro

[-]

Imaginary-Unit-3267@reddit

Okay, I must be missing something - I have a weak computer and get 20-24 t/s, but this feels... fine? Yeah, there's a lot of waiting around, but I use that time to work on stuff myself, since the machine is a helper and not a replacement for my own work. Is this not how everyone does it? Tell agent what to do, work in parallel on another aspect of the project while you wait, repeat?

[-]

akumaburn@reddit

If you want to do full hands off you may be dealing with a lot of slop later unless you use an orchestrator that also does QA something like my AutoIdeator https://github.com/akumaburn/AutoIdeator

[-]

Monkey_1505@reddit

Well take for example rocm, for amd cards. prefill is substantially faster than vulkan (I usually get 500+ t/s even on a potato mobile card), but generation speed is slower.

[-]

Fit_Split_9933@reddit

You're right. I've always thought that the real bottleneck for local LLMs is the prefill stage. During the PP the GPU is already running at full capacity, so unlike the TG where you can apply various techniques to improve speed . I saw a technique called PFlash before, but it comes at the cost of reduced accuracy.

[-]

silentsnake@reddit

Precisely, that’s why so many people keep on shitting on the DGX spark batch = 1 TPS without considering how the Blackwell in it chews through prefill tokens and unlocks batch > 1. Prefill speed is arguably more important for agentic workflows. On Macs/strix halo boxes, when the agent starts exploring your codebase by reading and greping around, you’ll see it choke, waiting for prefill (even with prefix caching on). So what when you can generate tokens twice or trice as fast, when you are reading 10x slower, the whole workflow is just slower! To make things worse, on strix halo/macs, the prefill drop off rates is much steeper than Blackwell as context grows long.

[-]

Puzzleheaded_Base302@reddit

OP is absolutely right.

when people test a new LLM, they try to type in as few words as possible to get as many output as possible. That way, the PP rate does not matter. The TG rate is something people can have a real feel.

A lot of people pay a lot of money to buy large RAM Mac. The TG rate seems ok, not too terrible, but the PP rate is just not there.

Worse, people spend a lot of effort to get large model to run with CPU, integrate GPU, NPU, etc. They quote TG rate at somewhere on par with human reading. But man, that 2 tps PP rate kills you when you try any agentic real work. 20min to get the first token on a small 9B dense model.

[-]

unjustifiably_angry@reddit

Like 1 in 50 people here actually use LLMs in any practical capacity, the rest fire up llama.cpp, type "hello", and seal clap

[-]

gtrak@reddit

Running qwen 3.6 27b on a 4090, with a prefill of 1k-2k tokens/s and generation speed of 30-40. I only feel the prefill when there's a cache problem and I'm at ~100k context. Token generation dominates my time for sure.

[-]

GrungeWerX@reddit

You’ve actually brought up a great point, something that has been bugging me a lot working with my own agent.

[-]

Ell2509@reddit

It is an issje I experience. Several models in llama.cpp reprocessing full context every turn.

Im models where this isn't the case, it is less impactful. But gpt-oss 120b and qwen3.6 27b both have this issue, as an example.

[-]

abnormal_human@reddit

You're not missing anything. In real world applications, prefill and cache reads/write dynamics are the main thing you optimize for and where most of the costs live.

However, if you're a recreational user having casual chats with LM studio or whatever, your prompts are short, your one conversation is always cached, and what's left to obsess over is t/s.

I think of t/s as a "good enough" thing. Around 50 things feel about as fluid as ChatGPT and other products people are familiar with and people won't be too offended. If you're literally reading the output it can be slower, maybe even down to 15-20, but usually it's an agent doing work--generating code, making tool calls, etc. And for that 50 is OK and 100 starts to feel fast.

[-]

Badger-Purple@reddit

I think if your prefill is 1-2k minimum, 25 feels ok for an agent. They really don’t talk much, and turn by turn a couple of phrases feels just natural at that speed. So with that in mind, 50tps is == current speeds for Alibaba’s own server for qwen-plus, per Artificial Intelligence == “cloud fast”. 100 is ripping fast.

[-]

rpkarma@reddit

Depends on your machine. I can get 1000+ effective prefill on my spark, but generation can always be made better

[-]

farkinga@reddit

This was an "aha" moment for me a few months ago, as well. Yes, I agree with you. I am willing to tolerate 15 t/s generation as long as I can get over 1000 t/s prompt processing.

Perhaps my workload is similar to yours; but yes, ingesting files is a big part of it and ... well, I went pretty deep on Qwen3.6 35b because I was seeing prompt processing speeds like 3000 and 4000 t/s. And it was just so good that I was almost willing to overlook the numerous ways it would mess up during generation.

Today, however, I'm running dense models and I am willing to accept slower speeds as long as the quality is better. Even so, it's all about that prompt processing. I'm still grinding to improve that part.

[-]

_TeflonGr_@reddit

Yes it is, and I'm tired of pretending it's not. Tho people may not realize it because they don't fill the cache much or when they do it's on agentic flows where it's not that apparent or not actively looking at the process. Also, the bottleneck is not that much in terms of compute but memory speed or transfer speed for multi GPU systems.

[-]

leonbollerup@reddit

pre-fill is only a problem if you costantly have to load the model.. if you like me.. have the model loaded all the time.. pre-fill is not a problem

[-]

ortegaalfredo@reddit

For chatbots prefill its almost non-existant as you don't have a lot of chat history to process. But for coding agents, the context fills up very quickly and that's where you need at least >500 tok/s prefill.

Also, token generation is not really that important if you shut down thinking but then you are losing most of the model intelligence.

[-]

ikkiho@reddit

Two things worth separating cleanly:

(1) Prefill and decode are bottlenecked by different physical resources. Prefill is compute-bound, since it's a big matmul over the full prompt, so it scales with TFLOPs. Decode is memory-bandwidth-bound, since each new token reads the entire KV cache from HBM, so it scales with HBM GB/s and shrinks the more KV you hold in flight. A single tok/s number can't predict either. That's also why a 4090 looks great on prefill but a Mac Studio with very wide unified memory bandwidth feels disproportionately good on decode for big models.

(2) There are two SLOs that get conflated: time-to-first-token vs tokens-per-second. Chat-style benchmarks (short prompts, streaming output) hide TTFT because the first token comes back fast, so TPS dominates how it feels. The crossover where TTFT starts dominating is somewhere around 2 to 4K prompt tokens depending on hardware. Long context (codebase summaries, RAG over big docs, agentic loops with growing tool histories) lives past that crossover, and that's exactly where you sit watching prefill chew the prompt. Public benchmark culture skews short-prompt because the suites were originally built for chat.

Two specifics on your numbers. 200 t/s prefill on Qwen 27B Q6 is suspiciously low for any reasonable 4090-class GPU. Modern kernels typically push 1500 to 3000 t/s prefill at that size. If you're seeing 200, you probably have layers offloaded to CPU, an undersized prefill batch, or a runtime that's not using the right matmul kernel. Worth profiling with nsight or just checking layer offload counts. Also "with prompt caching" can mean three different things (HTTP-level prefix cache, KV reuse across slots in llama.cpp, OS-side weight caching) and only one of them helps your workflow. Worth checking which one is actually firing.

Speculative decoding flips this further. Once draft models give you 2 to 4x decode speedup, prefill becomes an even larger fraction of wall-clock. Your intuition is roughly where the open-weights serving stack is heading.

[-]

ttkciar@reddit

It really depends on what you are doing.

If you're mostly doing batched inference, most of your overall time is spent on generation, not prompt processing, and since you're not staring at the screen waiting for a response, it matters not at all.

[-]

Ledeste@reddit

Based on the hype of the hyper-specific NVFP4 implementation, I don't think this is something people will forget. But indeed, with proper caching, it is much less of an issue.

I think it will be a big one soon, though, when proper agentic tools come, as the large amount of context switching tends to make the cache less effective. Also, with more models with 1T context length, new usage will come, and now they're pretty limited.