Made a CLI to run llms with turboquant with a 1 click setup. (open-source)
Posted by Osprey6767@reddit | LocalLLaMA | View on Reddit | 2 comments
Hey everyone,
I'm a junior dev with a 3090 and I've been running local models for a while. Llama.cpp still hasn't dropped official TurboQuant support, but turboquant is working great for me. I got a Q4 version of Qwen3.5-27B running with max context on my 3090 at 40 tps. Tested a ton of models in LM Studio using regular llama.cpp including glm-4.7-flash, gemma-4, etc. but Qwen3.5-27B was the best model I found. By official and truthful benchmarks from artificialanalysis.ai Gemma scores significantly lower than Qwen3.5-27B so I don't recommend it. I used a distilled Opus version from https://huggingface.co/Jackrong/Qwopus3.5-27B-v3-GGUF not the native Qwen3.5-27B. The model remembers everything and beats many cloud endpoints.
Built a simple CLI tool so anyone can test GGUF models from Hugging Face with TurboQuant. Bundles the compiled engine (exe + DLLs including CUDA runtime) so you don't need CMake or Visual Studio. Just git clone, run setup.bat, and you're done. I would add Mac support if enough people want it.
It auto-calculates VRAM before loading models (shows if it fits in your GPU or spills to RAM), saves presets so you don't type paths every time, and hosts a local endpoint so you can connect it to agentic coding tools. It's Apache 2.0 licensed, Windows only, and uses TurboQuant (turbo2/3/4).
Here's the repo: https://github.com/md-exitcode0/turbo-cli
If this avoids the build hell for you, a star is appreciated:)
DM me if any questions.
stddealer@reddit
I don't think mainline llama.cpp will ever support TurboQuant. The maintainers called the PRs trying to implement it"slop", which while it is very rude for the people who worked on implementing it, it is understandable since it's not measurably better than the current KV quantization methods in llama.cpp, it would add a lot of complexity, especially if it needs to be supported for all backends, and it is extremely over hyped.
Osprey6767@reddit (OP)
Truthfully, I tested it a ton and it is a ton better than regular kv quantization methods. I don't think it's overhyped. Cause the quality of the context is lossless while I can actually fit not 10k context into my 3090 but all 262k. And about the lossless part, I am totally serious.
First time I actually have a model that CAN reasonably code with full context. I won't call turboquant trash. Not overhyped. It delivers and is a big step toward usable and affordable local ai.