TurboQuant VS LM Studio Llama3.3 70b Q4_K_M
Posted by TimSawyer25@reddit | LocalLLaMA | View on Reddit | 9 comments
I did a quick and dirty test at 16k and it was pretty interesting.
Running on dual 3090's
Context Vram: Turbo 1.8gb -- LM 5.4gb
Turbo -- LM
12 fact recall: 8 / 8 -- 8 / 8
Instruction discipline : 1 rule violation -- 0 violations
Mid prompt recall trap: 5 / 5 -- 5 / 5
A1 to A20 item recall: 6 / 6 -- 6 / 6
Archive Loaded stress: 15 / 20 -- 20 / 20
Vault Sealed heavy distraction: 19 / 20 -- 20 / 20
Deep Vault Sealed near limit: 26 / 26 -- 26 / 26
Objective recall total: 79 / 85 -- 85 / 85
So LM did win, but Turbo did very well considering.
Tok/s was a tad slower with turboquant.
TTFT didn't change.
Super cool tech, thought I didn't check to see how large I could get the context. For head to head testing I couldn't fit more than 16k on the dual 3090's with LM, so I stopped there.
I think it's a fair trade off depending on your use case.
Anyone playing around with turboquant and seeing similar results?
FerLuisxd@reddit
Wait so turboquant can be enabled easy in lmstudio?
TimSawyer25@reddit (OP)
I wasn't using turboquant in lmstudio, I was comparing it to the same model loaded in lmstudio.
I haven't messed with it all since, so I don't know what people have managed to do with it. If it's even remotely possible to get them working together, I'm sure someone in the homelab scene has made it happen.
Hi-Angel@reddit
I am confused, so what are you comparing exactly to LMStudio Llama? From what I understand, turboquant is an algorithm for more efficient weights packing. Some version of Llama may as well be compressed by turboquant, so I don't quite understand topic title.
TimSawyer25@reddit (OP)
TurboQuant doesn’t alter the model weights. It compresses the KV cache/context memory. I used the same Llama 3.3 70B Q4_K_M GGUF model in both tests. LM Studio for the baseline run, and the llama.cpp TurboQuant fork launched from PowerShell for the TurboQuant run.
LevitySolution@reddit
From the FAQ: Is the zero-loss claim real?
At 3.5 bits, the paper reports quality neutrality on long-context benchmarks. At 2.5 bits there is a small drop on harder edge cases. You didn't mention if you had 2.5 or 3.5 bit, but if they are correct it would imply you had 2.5 bit compression.
TimSawyer25@reddit (OP)
I was running turbo3. -ctk turbo3 -ctv turbo3 which (as I understand it) is more or less 3.5 bit not 2.5. I'm assuming sharding may have played a role in my result. I'm going to hit it again with a smaller model on a single card and see how it behaves. I'll share what I find. Thanks for the FAQ, I hadn't seen it.
LevitySolution@reddit
Apparently some models are ok with compression of V but not K I think it is.
TimSawyer25@reddit (OP)
I just saw Alex Ziskinds video and he said thetom told him Q8 on K and Turbo on V. Appears that changed the game. Cool that he seems to show nearly identical performance issues with turbo 3 on both that I did in my quick and dirty test here.
fragment_me@reddit
I ran some KLD tests and it was worse than Q4_0. So it makes no sense to me. I think the implementation was not accurate but this is all foreign to me so I’m just speculating.