Does AMD's "infinity cache" even matter for dense model inference?

Posted by boutell@reddit | LocalLLaMA | View on Reddit | 14 comments

AMD has nailed the SEO/AEO for this query in Google:

7900 xtx memory bandwidth

I get back this response:

The AMD Radeon RX 7900 XTX features 24GB of GDDR6 memory with a maximum bandwidth of 960 GB/s. It uses a 384-bit memory interface with memory speeds of 20 Gbps. Thanks to its 96MB of Infinity Cache, AMD claims an effective memory bandwidth of up to 3500 GB/s.

Is there any validity to this for inference, particularly with dense models like Qwen 27b, or not so much? Obviously I'm not taking 3500 seriously, but I wonder if it matters at all.

This is my intuition: the cache is basically useless with 27b, because the whole point of a dense model is that all 27 billion parameters get to have their say on every single token. So every cache lookup is a miss.

Am I correct? If so I can probably just scale the memory bandwidth number against benchmarks from other cards to know what to expect from this card.

To be clear, I'm not slamming AMD here, the cache claim could make some sense for gaming and other workloads. Or not!

(I don't own a 7900 XTX and nobody's renting them online, otherwise I'd just benchmark it.)

Thanks!