DGX Spark reproducing the benchmarks by NVIDIA for training

Posted by khoka_x9@reddit | LocalLLaMA | View on Reddit | 2 comments

Anyone tried to repro the benchmark numbers for fine-tuning with DGX Spark ? Overall the number says llama 3.2 3B fine-tuning peak token/s is \~80k. That is roughly 8000/(2048*8) \~= 5steps/seconds.

In reality when I ran the llama3.2 3b fine-tune from here: https://build.nvidia.com/spark/pytorch-fine-tune

python Llama3_3B_full_finetuning.py

I got around \~0.5 step/seconds

============================================================
TRAINING COMPLETED
Training runtime: 106.51 seconds
Samples per second: 4.69
Steps per second: 0.59
Train loss: 1.0989

Which is roughly \~8k toks/seconds. Any idea what is the reason for this discrepancy or I'm misinterpreting the nvidia benchmark ?

[-]

United-Manner-7@reddit

Your interpretation and your math are both off.

First, the division you did is incorrect.
You wrote:
8000 / (2048 * 8) ≈ 5 steps/sec

But the actual result is:
8000 / 16384 = 0.4895… steps/sec
So the arithmetic alone is wrong.

Second, NVIDIA’s number is 80,000 tokens/sec peak, not 8,000.
You used the wrong value and then misinterpreted it.

Third, the “peak tokens/sec” in NVIDIA’s table is a synthetic forward-only throughput metric.
It excludes real training overhead such as backprop, optimizer step, dataloading, logging, and gradient accumulation.
You cannot directly convert that peak throughput to real training steps/sec.

Because of this, comparing NVIDIA’s peak number to your actual PyTorch fine-tuning run will always show roughly wrongly results.

khoka_x9@reddit (OP)

My bad, I meant, 80k tokens in the calculation, missed a zero. It should be 80000/ (2048*8) \~= 5