Does AMD's "infinity cache" even matter for dense model inference?

Posted by boutell@reddit | LocalLLaMA | View on Reddit | 14 comments

AMD has nailed the SEO/AEO for this query in Google:

7900 xtx memory bandwidth

I get back this response:

The AMD Radeon RX 7900 XTX features 24GB of GDDR6 memory with a maximum bandwidth of 960 GB/s. It uses a 384-bit memory interface with memory speeds of 20 Gbps. Thanks to its 96MB of Infinity Cache, AMD claims an effective memory bandwidth of up to 3500 GB/s.

Is there any validity to this for inference, particularly with dense models like Qwen 27b, or not so much? Obviously I'm not taking 3500 seriously, but I wonder if it matters at all.

This is my intuition: the cache is basically useless with 27b, because the whole point of a dense model is that all 27 billion parameters get to have their say on every single token. So every cache lookup is a miss.

Am I correct? If so I can probably just scale the memory bandwidth number against benchmarks from other cards to know what to expect from this card.

To be clear, I'm not slamming AMD here, the cache claim could make some sense for gaming and other workloads. Or not!

(I don't own a 7900 XTX and nobody's renting them online, otherwise I'd just benchmark it.)

Thanks!

[-]

Acu17y@reddit

The secret to the extra speed beyond 960GB/s on AMD, and therefore towards 3.5TB/s, is to ensure that the data required for any type of computation remains within the Infinity Cache as much as possible, avoiding having to constantly retrieve it from VRAM. In AI, this is done using the ubatch size command in llama.cpp.

If ubatch is too large: The intermediate data for the computation exceeds the 96MB of your cache. The GPU must "swap" with the VRAM, and performance is limited to 960GB/s.

If ubatch is within the cache size, then you don't saturate the XTX's Compute Units.

Try a value of 128 for dense and 64 for MoE.

[-]

666666thats6sixes@reddit

I tried this with 27b and found no benefits, peak pp performance is at both batch and ubatch >= 2048. Either it brings no benefit or the per-batch call cost is higher than gains from cache reuse.

Full runs: https://pastebox.io/paste/93rvtERb5msj

[-]

666666thats6sixes@reddit

I have a 7900xtx. I'm using 27B Q8 asynchronously so speed is not a factor, it runs at a few t/s partially offloaded to DDR4 and I just collect the results later. I'm assuming you're interested in a smaller quant.

Let me know what numbers you're after, or better yet link me to the exact gguf and the llama-bench incantation and I'll run it.

[-]

boutell@reddit (OP)

Oh hell yes thank you!

The link is:

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/blob/9e3417c2ce78c6214c8be9cb7a8b0927b1be2c8b/Qwen3.6-27B-IQ4_XS.gguf

The bench command:

llama-bench -r 3 -m ~/models/unsloth/Qwen3.6-27B-IQ4_XS.gguf -p 100000 -n 30000 -fa 1 -n 0 -ctk q8_0 -ctv q8_0

[-]

boutell@reddit (OP)

Actually, this stays under 128K context which might matter? Not sure how close to the ceiling we'll be VRAM wise.

llama-bench -r 3 -m \~/models/unsloth/Qwen3.6-27B-IQ4_XS.gguf -p 100000 -n 28000 -fa 1 -n 0 -ctk q8_0 -ctv q8_0

[-]

666666thats6sixes@reddit

It's running, it will be a while looks like :-)

I did some quick tests to see if --fit is succeeding and yes, the 128k fits into VRAM and runs at very nice speeds (35 t/s decode initially).

I added a few more lengths to get a better view of the behavior around the 128k mark. This is currently running:

llama-bench -r 3 -m Qwen3.6-27B-IQ4_XS.gguf -p 100000 -n 1000,25000,28000,30000 -fa 1 -n 0 -ctk q8_0 -ctv q8_0

Is the extra -n 0 meant to be there?

[-]

666666thats6sixes@reddit

Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

test	t/s
pp100000	410.22 ± 0.58
tg1000	34.99 ± 0.02
tg25000	31.95 ± 0.03
tg28000	31.57 ± 0.01
tg30000	31.33 ± 0.03

tagging /u/boutell

[-]

boutell@reddit (OP)

Thank you!

This is good news. A bit better than I would have predicted from scaling my Mac numbers based on memory bandwidth alone.

Really appreciate it.

Now I can rent a cloud card of similar performance for a real world use test of 27b in our household (one CEO-oper with long dev experience, one design-oper who also has dev and dev leadership experience, so we both thrive on AI dev tools)

[-]

boutell@reddit (OP)

Sorry no, -n 0 would give us no data on token generation time, my bad for leaving it in there

[-]

Simple_Library_2700@reddit

Nvidia cards also have cache

[-]

dsanft@reddit

Cache isn't useless. Otherwise why have it?

No it's not going to speed up decode but a bigger cache will help with GEMM (prefill) where you benefit from tile reuse.

[-]

boutell@reddit (OP)

It's a graphics card, not an llms only card, so I was considering the possibility that it might be useless for this but useful for gaming.

Faster pre-fill sounds nice.

[-]

Double_Cause4609@reddit

It's kind of hard to explain because this involves a lot of things that you have to analyze information-theoretically, but the short answer:

Cache only matters up to a point for modern language models and inference backends.

That is, you basically need enough for the algorithms to run as expected, and everything after that is a marginal increase for the most part.

Longer answer: Think about how modern Artificial Neural Networks are calculated. The main bandwidth cost (relevant here), is the linear layers. A linear layer is basically rows of numbers arranged in blocks. One must load the block, do an operation on it, drop the block, and load the next. This is fundamentally bandwidth limited, and it's limited by the global (slowest) memory, which in the case of a GPU is VRAM.

Now, there are other architectures. There are architectures where one could load that portion of the weights to cache and operate on them multiple times to refine the answer locally (it doesn't work to this fine grained a degree, but work like looped transformers is pointing in this direction, and you could also frame this is a denoising operation like a diffusion model, etc), in which case while your total tokens per second would still be bound by the outer loop, your inner loop(s) could make the model perform like a bigger model for effectively free (in the sense that the inner loops in cache would be extremely cheap comparatively).

The main issue is nobody has trained a model on those principles and it's hard to segment the network in such a way as for that operation to specifically be friendly to exploiting larger cache sizes like that.

In theory there's nothing stopping it, though.

[-]

DeltaSqueezer@reddit

If your model fits fully into the cache, then it could be useful. Otherwise, given that most models we are interested in are multi-gigabytes large, you'll still hit the normal memory bandwidth limit as most of it will not live on the cache.