TurboQuant VS LM Studio Llama3.3 70b Q4_K_M

Posted by TimSawyer25@reddit | LocalLLaMA | View on Reddit | 9 comments

I did a quick and dirty test at 16k and it was pretty interesting.

Running on dual 3090's

Context Vram: Turbo 1.8gb -- LM 5.4gb

Turbo -- LM
12 fact recall: 8 / 8 -- 8 / 8

Instruction discipline : 1 rule violation -- 0 violations

Mid prompt recall trap: 5 / 5 -- 5 / 5

A1 to A20 item recall: 6 / 6 -- 6 / 6

Archive Loaded stress: 15 / 20 -- 20 / 20

Vault Sealed heavy distraction: 19 / 20 -- 20 / 20

Deep Vault Sealed near limit: 26 / 26 -- 26 / 26

Objective recall total: 79 / 85 -- 85 / 85

So LM did win, but Turbo did very well considering.

Tok/s was a tad slower with turboquant.

TTFT didn't change.

Super cool tech, thought I didn't check to see how large I could get the context. For head to head testing I couldn't fit more than 16k on the dual 3090's with LM, so I stopped there.

I think it's a fair trade off depending on your use case.

Anyone playing around with turboquant and seeing similar results?

[-]

FerLuisxd@reddit

Wait so turboquant can be enabled easy in lmstudio?

[-]

TimSawyer25@reddit (OP)

I wasn't using turboquant in lmstudio, I was comparing it to the same model loaded in lmstudio.

I haven't messed with it all since, so I don't know what people have managed to do with it. If it's even remotely possible to get them working together, I'm sure someone in the homelab scene has made it happen.

[-]

Hi-Angel@reddit

I am confused, so what are you comparing exactly to LMStudio Llama? From what I understand, turboquant is an algorithm for more efficient weights packing. Some version of Llama may as well be compressed by turboquant, so I don't quite understand topic title.

[-]

TimSawyer25@reddit (OP)

TurboQuant doesn’t alter the model weights. It compresses the KV cache/context memory. I used the same Llama 3.3 70B Q4_K_M GGUF model in both tests. LM Studio for the baseline run, and the llama.cpp TurboQuant fork launched from PowerShell for the TurboQuant run.

[-]

LevitySolution@reddit

From the FAQ: Is the zero-loss claim real?

At 3.5 bits, the paper reports quality neutrality on long-context benchmarks. At 2.5 bits there is a small drop on harder edge cases. You didn't mention if you had 2.5 or 3.5 bit, but if they are correct it would imply you had 2.5 bit compression.

[-]

TimSawyer25@reddit (OP)

I was running turbo3. -ctk turbo3 -ctv turbo3 which (as I understand it) is more or less 3.5 bit not 2.5. I'm assuming sharding may have played a role in my result. I'm going to hit it again with a smaller model on a single card and see how it behaves. I'll share what I find. Thanks for the FAQ, I hadn't seen it.

[-]

LevitySolution@reddit

Apparently some models are ok with compression of V but not K I think it is.

[-]

TimSawyer25@reddit (OP)

I just saw Alex Ziskinds video and he said thetom told him Q8 on K and Turbo on V. Appears that changed the game. Cool that he seems to show nearly identical performance issues with turbo 3 on both that I did in my quick and dirty test here.

[-]

fragment_me@reddit

I ran some KLD tests and it was worse than Q4_0. So it makes no sense to me. I think the implementation was not accurate but this is all foreign to me so I’m just speculating.