DGX Spark reproducing the benchmarks by NVIDIA for training

Posted by khoka_x9@reddit | LocalLLaMA | View on Reddit | 2 comments

Anyone tried to repro the benchmark numbers for fine-tuning with DGX Spark ? Overall the number says llama 3.2 3B fine-tuning peak token/s is \~80k. That is roughly 8000/(2048*8) \~= 5steps/seconds.

In reality when I ran the llama3.2 3b fine-tune from here: https://build.nvidia.com/spark/pytorch-fine-tune

python Llama3_3B_full_finetuning.py

I got around \~0.5 step/seconds

============================================================
TRAINING COMPLETED
Training runtime: 106.51 seconds
Samples per second: 4.69
Steps per second: 0.59
Train loss: 1.0989

Which is roughly \~8k toks/seconds. Any idea what is the reason for this discrepancy or I'm misinterpreting the nvidia benchmark ?