SambaNova and Intel Announce Blueprint for Heterogeneous Inference: GPUs For Prefill, SambaNova RDUs for Decode, and Intel® Xeon® 6 CPUs for Agentic Tools

Posted by Primary_Olive_5444@reddit | hardware | View on Reddit | 4 comments

https://sambanova.ai/press/sambanova-announces-collaboration-with-intel-on-ai-solution

Sambanova announcement:

In this new design:

GPUs handle the highly parallel prefill phase, turning long prompts into key‑value caches efficiently.
SambaNova RDUs sit alongside Xeon 6 as the dedicated inference fabric for high‑throughput, low‑latency decode, ensuring that once the CPUs have set up the work, tokens are generated quickly and efficiently.
Xeon 6 is the host CPU and system control plane, responsible for agentic task coordination, workload distribution, tool and API execution, and system‑level behavior, while also serving as the action CPU that compiles and executes code and validates results.

https://hc2024.hotchips.org/assets/program/conference/day1/48_HC2024.Sambanova.Prabhakar.final-withoutvideo.pdf

[](

It seems like a RDU, is for faster data (load and unload) movements (relative to GPU hardware data movement performance) during inference.

For a given inference task, you load all the relevant expert models related to that task/prompt into DDR memory first and then fast-swapping it out during the different phases until completion of that task.

Phase 1: I use model A that is best in this part of the workload

Phase 2: then load model B (which is good for another part of the work) and move out A (maybe start preparing C loading meantime?)

Phase 3: model C (move out B and load C)

Is this how it works roughly?

[-]

aibruhh@reddit

‪Oooo super interesting, Nvidia is not ready for this! I did a breakdown here:‬

‪https://open.substack.com/pub/digdeeptech/p/intels-ai-comeback-is-a-bet-that?r=z7nbh&utm_medium=ios‬

Primary_Olive_5444@reddit (OP)

thanks for sharing the linkl.

Sambanova's RDU -> is meant for reconfiguring how data moves across the Compute Unit (but in a CPU parlance it's a just execution ports that does maths computations)

So would you say that the value-adding proposition is coming from:

Users -> provides prompt to the tasks

Based on the prompts -> load X number of models that are most well tuned to deliver the optimal or cost efficient results to the tasks.

So it has more AGU (address generation units) to compute the source and dst address across all 3 tiers of memory -> DDR5 -> HBM -> SRAM (register files)

And to support more of that operations it has more "dedicated read | write ports to handle data movements. Higher bandwidths (bi-directional)

But that got me thinking about the signal and power line routing.

If the RDU is handling such a big mesh like inter-connect design, that means the clock speeds has got to be lower, right?

Helpdesk_Guy@reddit

For anyone wondering what's that new term RDU being thrown around means again, it's supposed to standing for Reconfigurable Dataflow Unit — Pretty much what a GPU already is …

In essence a fancy new word for what we would've called an advanced ASIC back in the days.

blueredscreen@reddit

SambaNova. Oh, that one again.