DGX Spark reproducing the benchmarks by NVIDIA for training
Posted by khoka_x9@reddit | LocalLLaMA | View on Reddit | 2 comments
Anyone tried to repro the benchmark numbers for fine-tuning with DGX Spark ? Overall the number says llama 3.2 3B fine-tuning peak token/s is \~80k. That is roughly 8000/(2048*8) \~= 5steps/seconds.
In reality when I ran the llama3.2 3b fine-tune from here: https://build.nvidia.com/spark/pytorch-fine-tune
python Llama3_3B_full_finetuning.py

I got around \~0.5 step/seconds
============================================================
TRAINING COMPLETED
Training runtime: 106.51 seconds
Samples per second: 4.69
Steps per second: 0.59
Train loss: 1.0989
Which is roughly \~8k toks/seconds. Any idea what is the reason for this discrepancy or I'm misinterpreting the nvidia benchmark ?
United-Manner-7@reddit
Your interpretation and your math are both off.
First, the division you did is incorrect.
You wrote:
8000 / (2048 * 8) ≈ 5 steps/secBut the actual result is:
8000 / 16384 = 0.4895… steps/secSo the arithmetic alone is wrong.
Second, NVIDIA’s number is 80,000 tokens/sec peak, not 8,000.
You used the wrong value and then misinterpreted it.
Third, the “peak tokens/sec” in NVIDIA’s table is a synthetic forward-only throughput metric.
It excludes real training overhead such as backprop, optimizer step, dataloading, logging, and gradient accumulation.
You cannot directly convert that peak throughput to real training steps/sec.
Because of this, comparing NVIDIA’s peak number to your actual PyTorch fine-tuning run will always show roughly wrongly results.
khoka_x9@reddit (OP)
My bad, I meant, 80k tokens in the calculation, missed a zero. It should be 80000/ (2048*8) \~= 5