fragment_me

In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 18 comments

In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 18 comments

fragment_me@reddit (OP)

I believe that the imatrix files they're using weigh different values based on activation importance so that the final scales created are more tailored to the more important values. But those less "important" values are suffering from quantization. Imagine just skipping them from quantization completely and now you get the best of both worlds with a very small in crease in size (1% probably).

In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 18 comments

fragment_me@reddit (OP)

I appreciate the positive energy! I tried but ran into some roadblocks which I believe were related to the code paths that are highly optimized for certain values. On paper this idea seems so simple and could provide some good quality for low quant models. Ultimately, I think I will need to not vibe it so i can fully understand the issues.

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

fragment_me@reddit

Something-something ask and you shall receive [https://nvidia.custhelp.com/app/answers/detail/a\_id/5165/\~/nvidia-resizable-bar-firmware-update-tool](https://nvidia.custhelp.com/app/answers/detail/a_id/5165/~/nvidia-resizable-bar-firmware-update-tool)

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

fragment_me@reddit

You can upgrade the firmware for rebar through an nvidia utility that works in linux and windows. I just passed the GPUs to a windows VM since it was easier and then moved them back to the ubuntu VM. The hardware says rebar support but I had issues with enabling it in Ubuntu since my actual host server doesn't support it. I have a newer server I'll be moving the cards into later this week to fully test it.

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

fragment_me@reddit

Hmm I thought I responded but can't find it. Anyway I'm the redditor you trusted! I just purchased 2 more to bring me up to 120GB vram \*salute\* [https://ebay.us/iAXbPQ](https://ebay.us/iAXbPQ)

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

Family member just passed away this morning , need a distraction. Any good 1b models you can suggest for layla ??

Posted by Opening-Ad6258@reddit | LocalLLaMA | View on Reddit | 14 comments

Family member just passed away this morning , need a distraction. Any good 1b models you can suggest for layla ??

Posted by Opening-Ad6258@reddit | LocalLLaMA | View on Reddit | 14 comments

Has anyone experimented with stabilizing low quant models with lower temp and top p?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 12 comments

Has anyone experimented with stabilizing low quant models with lower temp and top p?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 12 comments

fragment_me@reddit (OP)

I was thinking about trying the BF16 version with a seed and then trying the quantized version with the same seed and prompt to see how much it differs. I think after several tests you could find some optimal parameters. It may help to see some same top p benchmarks between the native and quant to understand how much it diverges.

Has anyone experimented with stabilizing low quant models with lower temp and top p?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 12 comments

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

fragment_me@reddit (OP)

That's a good point. It makes sense that BF16 would be slower PP due to it being compute bound. I guess llama-cpp just doesn't focus much on BF16 due to it being less adopted. It's unfortunate because it's the native quant for a lot of the great local models. Also, I finally have enough VRAM to run Qwen 3.6 27B BF16, and I tested the PP speeds, and it's unusable for me. It's about \~260 pp tok/s with 2x 3090 and 2x 3080. Token gen is \~41 tok/s with MTP set to 4. It's unfortunate that llama-cpp does better with MTP but worse with PP whereas vLLM is the opposite. So basically stick to vLLM since I now have 4 GPUs. And I was able to confirm that it also produced quite a bit less reasoning tokens. So it seems that this quant is working very well. I was worried that the less thinking was due to be lobotomized, but it doesn't seem to be the case. The more thinking seems to trend with lower/worse quant.

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

fragment_me@reddit (OP)

I believe it's just due to Unsloth quanting attn\_qkv to Q8\_0 while this keeps it at BF16. So this quant might be bigger. Try a few prompts with the same seed value and let me know how it goes.

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Posted by FantasticNature7590@reddit | LocalLLaMA | View on Reddit | 24 comments

8GB 2017 MacBook Air breaks record with Quantum Processor help on tuning a 30B Qwen MoE model - Quantum 15,489% boost!

Posted by Overall-Importance54@reddit | LocalLLaMA | View on Reddit | 57 comments

Qwen3.6-27B Quantization Benchmark

Posted by bobaburger@reddit | LocalLLaMA | View on Reddit | 73 comments

Info: Nvidia Cuda 13.3 landed

Posted by parrot42@reddit | LocalLLaMA | View on Reddit | 47 comments

Info: Nvidia Cuda 13.3 landed

Posted by parrot42@reddit | LocalLLaMA | View on Reddit | 47 comments

SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More

Posted by CuriousPlatypus1881@reddit | LocalLLaMA | View on Reddit | 41 comments

fragment_me@reddit

I am disappointed to see local models missing from this. We already know Gemini, ChatGPT, Claude, and DeepSeek are good. We want to know how good local models do because many of them seem to be benchmaxed, and it's hard to discern their true level.

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable)

Posted by Gailenstorm@reddit | LocalLLaMA | View on Reddit | 44 comments

Next year we're getting 0.5T model from Grok

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 200 comments

Tinygrad: Hacked 4090 driver to enable P2P

Posted by mrdevlar@reddit | LocalLLaMA | View on Reddit | 17 comments

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

AMD BC-250 and the search for Cheap Compute

Posted by dugganmania@reddit | LocalLLaMA | View on Reddit | 41 comments

I guess 4 units wasn’t enough.

Posted by Simple_Library_2700@reddit | LocalLLaMA | View on Reddit | 35 comments

Heretic has been served a legal notice by Meta, Inc.

Posted by -p-e-w-@reddit | LocalLLaMA | View on Reddit | 349 comments

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

fragment_me@reddit (OP)

This is not the same one I got. Mine has a sleek black case with 1 blower style fan. Made more for to be in a server chassis. Although it's not very loud. Internally it's probably very similar (in terms of the new VRAM chips).

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

fragment_me@reddit (OP)

I didn't see any NVLink when I set them up. TBH I wouldn't worry much about it unless you plan to do training. Inference is pretty fast over PCIE. It's even faster if you have a newer mobo that supports rebar as you can download modified nVidia drivers to enable P2P communication between the cards.

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences?

Posted by Borkato@reddit | LocalLLaMA | View on Reddit | 93 comments

fragment_me@reddit

Gemma is very sensitive to model and KV cache quant. Qwen not so much. I think Qwen Q6 is great, but I personally use Q8 and above. I would never touch Gemma 4 without being Q8 or better.

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

fragment_me@reddit (OP)

Surprisingly not high. The fan style does seem to be for servers, but I have them outside of my DELL R730 with a PCIE riser cable and I cut holes in the top of the case. I didn't notice them much. I definitely won't trust having them in the same room AND outside the case. If these were in a 2U server, they wouldn't be too loud.

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

fragment_me@reddit (OP)

I would give it a shot. In terms of per GB cost, it's roughly the same, but the 4080 is newer; supported longer and may be a bit faster on PP. I went the ebay route just because it's easier.

I hope that someday we will have a 124B Gemma.

Posted by cgs019283@reddit | LocalLLaMA | View on Reddit | 77 comments

fragment_me@reddit

Are you on OpenWebUI? If so, you may need to set native tool calling as it's not a default. Also, there seems to be some bug in OpenWebUI where after 10+ tool calls it just stops. I've been able to repeat this using llama-server and vLLM as the backend. It's a shame vLLM doesn't bundle a webUI like llama-cpp does.

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 89 comments

fragment_me@reddit

If you're using F16 KV cache might as well try BF16 (if your hardware can handle it). See here for some KLD benchmarks for Qwen3.5. [https://techstat.net/qwen3-5-27b-q8-kv-cache-benchmarks-bf16-vs-f16-vs-q8\_0/](https://techstat.net/qwen3-5-27b-q8-kv-cache-benchmarks-bf16-vs-f16-vs-q8_0/)

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 89 comments

fragment_me@reddit

It depends on the model quite a bit I'm learning based on various benchmarks. Q8 *usually* is pretty good but degrades at long context. I wouldn't go lower. I personally stick to native KV cache quant now. For Qwen, that's actually BF16, not F16 as the default in llama CPP is. If you really want to go lower, reduce V but keep K higher. E.g. K as BF16 and V as Q8\_0.

"Elias Thorne" is what eight different LLMs name a lighthouse keeper. He's also selling cancer treatment advice on Amazon

Posted by prescorn@reddit | LocalLLaMA | View on Reddit | 54 comments

fragment_me@reddit

That's comical, and just reinforces how much I hate reading anything LLM generated. Ironically, I end up seeing this text so much because one of my favorite quick and dirty benchmarks for throughput is to tell a model to write me a 2000 word short story. In fact, I just did it now because I made some MTP setting change. Sure enough, here's the title - **The Last Clockmaker of New Veridia.**

"Elias Thorne" is what eight different LLMs name a lighthouse keeper. He's also selling cancer treatment advice on Amazon

Posted by prescorn@reddit | LocalLLaMA | View on Reddit | 54 comments

MTP support merged into llama.cpp

Posted by tacticaltweaker@reddit | LocalLLaMA | View on Reddit | 108 comments

fragment_me@reddit

There are many benchmarks already out that show TQ4 performing just under Q4. And vLLM just released their benchmarks too on it. Any KV cache at 4 bits (whether TQ or Q) suffers as context window grows. When TQ KLD performs better than Q8, then I will jump on the band wagon. I really want it to succeed but so far it’s not. The only positive thing that came out of it is the hadamard rotation got attention and it was implemented in llama main and IK llama.

MTP support merged into llama.cpp

Posted by tacticaltweaker@reddit | LocalLLaMA | View on Reddit | 108 comments

MTP support merged into llama.cpp

Posted by tacticaltweaker@reddit | LocalLLaMA | View on Reddit | 108 comments

fragment_me@reddit

You are on some custom form because turbo quant is not available in llama cpp. I highly suggest you stick to Q8 kv cache if strapped for memory. Q4 is better than T4, yet 4 bit KV cache just doesnt maintain accuracy. Further, KV cache effects are more pronounced at higher context window. You’re better off switching to Q8 and halving the cache. Or just switch to Q5 model if you need more cache. Finally, you need the —spec-type parameter for MTP when you switch to the main repo and branch.

MTP support merged into llama.cpp

Posted by tacticaltweaker@reddit | LocalLLaMA | View on Reddit | 108 comments

MTP PR Merged!!!

Posted by Valuable_Touch5670@reddit | LocalLLaMA | View on Reddit | 101 comments

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

fragment_me@reddit (OP)

The definition in this context is that the total amount of tokens generated during reasoning is less in every test I ran, as shown in the examples.