Info: Nvidia Cuda 13.3 landed

[-]

LinkSea8324@reddit

▶ New Features ▶ Enabled memory-parsimonious tiling for FP64 emulated matrix multiplications. This improvement ensures that the workspace memory budget no longer exceeds 8 GB. ▶ Added support for CUDA Green contexts. ▶ Improved FP4 matrix multiplication performance on Blackwell Ultra GPUs by a geometric mean of 5% across a wide range of problems, with up to 7% speedup for some small problems. ▶ Improved TF32 matrix multiplication performance on Blackwell and Blackwell Ultra GPUs by a geometric mean of 27% across a wide range of problems and layouts, with up to 3.5x speedup for some small problems. ▶ Improved TF32 TN matrix multiplication performance on Hopper GPUs by a geometric mean of 11% across a wide range of problems, with up to 40% speedup for some small problems. ▶ Improved SYMV performance with TMA-based acceleration for Hopper, Blackwell, and Blackwell Ultra kernels.

[-]

dtdisapointingresult@reddit

I'm no expert on this but it seems to me none of these help 99.9% of users on this sub?

FP4 improvements are for Blackwell Ultra which none of us have, and TF32 are some weird type none of us use.

Is there any benefit for your average Blackwell consumer GPU user?

[-]

Dany0@reddit

Helllll yeah gotta go test this. CUDA updates historically gave much more speedup than lately, if this is confirmed it'd be lovely

[-]

nmrk@reddit

Oh nice! Drivers and CUDA updated automatically in Proxmox, running fine.

[-]

ilintar@reddit

Yeah, the bug from 13.2 is finally fixed.

[-]

Thireus@reddit

Did they solve the iq*_s quantization issues?

[-]

parrot42@reddit (OP)

I tried `test-backend-ops test -o MUL_MAT_ID -b CUDA0` with b9357 and cuda 13.3. Now there are no iq errors anymore!

[-]

parrot42@reddit (OP)

I have no idea, but it works. I am stress-testing it by installing supabase for honcho for hermes using opencode and qwen and it is doing good.

[-]

a_beautiful_rhind@reddit

Nothing for my 3090s in it, most likely.

[-]

a_beautiful_rhind@reddit

Probably no speedups for older gpu on this update.

[-]

Freonr2@reddit

torchao have bf16 stochastic rounding on sm12x yet?

[-]

kivaougu@reddit

Hopefully this has had better QA than 13.2

[-]

vladlearns@reddit

they definitely need to hire more good QAs

[-]

lowlifecat@reddit

Thank you. anything good in the update? i mean any update is a good update but is there a *good* update?

[-]

As I am not understanding the release notes, I told opencode/qwen to do a nvidia-smi and read the notes and it told me that cuBLAS is 5% faster, TF32 is 27% faster on Blackwell and it could unlock tile based rendering, when implemented into llama.cpp.
So I think it is a good update, but what do I know?

[-]

parrot42@reddit (OP)

Just downloaded and installed cuda 13.3 with driver 610.43.02
Much smoother setup under trixie with a backported 7.0 kernel than 12.2.1
Recompiled llama.cpp and everything seems to work (but I just tested with 5 messages to opencode).

[-]