fragment_me

In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

fragment_me@reddit (OP)

Insightful, thank you.

Reply

In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

fragment_me@reddit (OP)

I believe that the imatrix files they're using weigh different values based on activation importance so that the final scales created are more tailored to the more important values. But those less "important" values are suffering from quantization. Imagine just skipping them from quantization completely and now you get the best of both worlds with a very small in crease in size (1% probably).

Reply

In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

fragment_me@reddit (OP)

I appreciate the positive energy! I tried but ran into some roadblocks which I believe were related to the code paths that are highly optimized for certain values. On paper this idea seems so simple and could provide some good quality for low quant models. Ultimately, I think I will need to not vibe it so i can fully understand the issues.

Reply

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

[-]

fragment_me@reddit

Something-something ask and you shall receive [https://nvidia.custhelp.com/app/answers/detail/a\_id/5165/\~/nvidia-resizable-bar-firmware-update-tool](https://nvidia.custhelp.com/app/answers/detail/a_id/5165/~/nvidia-resizable-bar-firmware-update-tool)

Reply

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

[-]

fragment_me@reddit

You can upgrade the firmware for rebar through an nvidia utility that works in linux and windows. I just passed the GPUs to a windows VM since it was easier and then moved them back to the ubuntu VM. The hardware says rebar support but I had issues with enabling it in Ubuntu since my actual host server doesn't support it. I have a newer server I'll be moving the cards into later this week to fully test it.

Reply

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

[-]

fragment_me@reddit

Huh? I never posted a picture of a desk

Reply

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

[-]

fragment_me@reddit

Hmm I thought I responded but can't find it. Anyway I'm the redditor you trusted! I just purchased 2 more to bring me up to 120GB vram \*salute\* [https://ebay.us/iAXbPQ](https://ebay.us/iAXbPQ)

Reply

I trusted random person on this subreddit and bought 3080 20gb made of chinesium

Posted by SwimmerJazzlike@reddit | LocalLLaMA | View on Reddit | 243 comments

[-]

fragment_me@reddit

Glad you listed to me, the internet stranger. I too want to buy more.

Reply

Family member just passed away this morning , need a distraction. Any good 1b models you can suggest for layla ??

Posted by Opening-Ad6258@reddit | LocalLLaMA | View on Reddit | 14 comments

[-]

fragment_me@reddit

What do you mean didn't work? They work

Reply

Family member just passed away this morning , need a distraction. Any good 1b models you can suggest for layla ??

Posted by Opening-Ad6258@reddit | LocalLLaMA | View on Reddit | 14 comments

[-]

fragment_me@reddit

Try the free models on open router let us know how it goes, and sorry for your loss.

Reply

Has anyone experimented with stabilizing low quant models with lower temp and top p?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 12 comments

[-]

fragment_me@reddit (OP)

Are you okay? Is everything alright at home?

Reply

Has anyone experimented with stabilizing low quant models with lower temp and top p?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 12 comments

[-]

fragment_me@reddit (OP)

I was thinking about trying the BF16 version with a seed and then trying the quantized version with the same seed and prompt to see how much it differs. I think after several tests you could find some optimal parameters. It may help to see some same top p benchmarks between the native and quant to understand how much it diverges.

Reply

Has anyone experimented with stabilizing low quant models with lower temp and top p?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 12 comments

[-]

fragment_me@reddit (OP)

I never use these for being creative anyway but I can see how that could be detrimental to it.

Reply

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

fragment_me@reddit (OP)

**Update 1: Tested BF16 Q3.6 27B and saw with the same questions it was thinking only 2k less tokens than Q8\_0 and UD Q8 K XL.**

Reply

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

fragment_me@reddit (OP)

I was wrong! I used the wrong values for BF16. It's thinking similar about 2k less tokens than Q8\_0 and UD Q8 K XL.

Reply

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

fragment_me@reddit (OP)

That's a good point. It makes sense that BF16 would be slower PP due to it being compute bound. I guess llama-cpp just doesn't focus much on BF16 due to it being less adopted. It's unfortunate because it's the native quant for a lot of the great local models. Also, I finally have enough VRAM to run Qwen 3.6 27B BF16, and I tested the PP speeds, and it's unusable for me. It's about \~260 pp tok/s with 2x 3090 and 2x 3080. Token gen is \~41 tok/s with MTP set to 4. It's unfortunate that llama-cpp does better with MTP but worse with PP whereas vLLM is the opposite. So basically stick to vLLM since I now have 4 GPUs. And I was able to confirm that it also produced quite a bit less reasoning tokens. So it seems that this quant is working very well. I was worried that the less thinking was due to be lobotomized, but it doesn't seem to be the case. The more thinking seems to trend with lower/worse quant.

Reply

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

fragment_me@reddit (OP)

I believe it's just due to Unsloth quanting attn\_qkv to Q8\_0 while this keeps it at BF16. So this quant might be bigger. Try a few prompts with the same seed value and let me know how it goes.

Reply

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Posted by FantasticNature7590@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

fragment_me@reddit

The text is white. Just put it in Google sheets or microsoft excel. Not everything needs to be formatted by AI.

Reply

8GB 2017 MacBook Air breaks record with Quantum Processor help on tuning a 30B Qwen MoE model - Quantum 15,489% boost!

Posted by Overall-Importance54@reddit | LocalLLaMA | View on Reddit | 57 comments

[-]

fragment_me@reddit

Uhhhh some of these parameters are not really great. Like reducing the number of experts.

Reply

Qwen3.6-27B Quantization Benchmark

Posted by bobaburger@reddit | LocalLLaMA | View on Reddit | 73 comments

[-]

fragment_me@reddit

What are the margins for noise on these because some of them don't exactly make sense. E.g. lower KLD but worse top P.

Reply

Info: Nvidia Cuda 13.3 landed

Posted by parrot42@reddit | LocalLLaMA | View on Reddit | 47 comments

[-]

fragment_me@reddit

Isn't that what I said?

Reply

Info: Nvidia Cuda 13.3 landed

Posted by parrot42@reddit | LocalLLaMA | View on Reddit | 47 comments

[-]

fragment_me@reddit

Sweet rig but I don't think 5090 is Blackwell Ultra. I think it's just Blackwell.

Reply

SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More

Posted by CuriousPlatypus1881@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

fragment_me@reddit

I am disappointed to see local models missing from this. We already know Gemini, ChatGPT, Claude, and DeepSeek are good. We want to know how good local models do because many of them seem to be benchmaxed, and it's hard to discern their true level.

Reply

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

fragment_me@reddit (OP)

Probably.

Reply

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

fragment_me@reddit (OP)

I lower them down to 265 watts and I don't notice a difference in pp t/s and tg t/s

Reply

NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable)

Posted by Gailenstorm@reddit | LocalLLaMA | View on Reddit | 44 comments

[-]

fragment_me@reddit

Very cool

Reply

Next year we're getting 0.5T model from Grok

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 200 comments

[-]

fragment_me@reddit

He also said he’d have a man on the moon already

Reply

Tinygrad: Hacked 4090 driver to enable P2P

Posted by mrdevlar@reddit | LocalLLaMA | View on Reddit | 17 comments

[-]

fragment_me@reddit

Man you were way wrong, lol.

Reply

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

fragment_me@reddit (OP)

Just double checked and they do have nvlink but I don’t have the bridge

Reply

AMD BC-250 and the search for Cheap Compute

Posted by dugganmania@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

fragment_me@reddit

You are such a dog, I love it.

Reply

I guess 4 units wasn’t enough.

Posted by Simple_Library_2700@reddit | LocalLLaMA | View on Reddit | 35 comments

[-]

fragment_me@reddit

Cut a hole in the case and just run a PCIE extender. That's what I do with my 2U server.

Reply

Heretic has been served a legal notice by Meta, Inc.

Posted by -p-e-w-@reddit | LocalLLaMA | View on Reddit | 349 comments

[-]

fragment_me@reddit

*Salutes aggressively*

Reply

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

fragment_me@reddit (OP)

This is not the same one I got. Mine has a sleek black case with 1 blower style fan. Made more for to be in a server chassis. Although it's not very loud. Internally it's probably very similar (in terms of the new VRAM chips).

Reply

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

fragment_me@reddit (OP)

I know, crazy. I bought my 3090 used like 4-5 years ago for $800. What a time.

Reply

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

fragment_me@reddit (OP)

I took a look and didn't see any that were cheap if you buy a single card. You'd have to buy 3+ to get any kind of deal.

Reply

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

fragment_me@reddit (OP)

I didn't see any NVLink when I set them up. TBH I wouldn't worry much about it unless you plan to do training. Inference is pretty fast over PCIE. It's even faster if you have a newer mobo that supports rebar as you can download modified nVidia drivers to enable P2P communication between the cards.

Reply

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences?

Posted by Borkato@reddit | LocalLLaMA | View on Reddit | 93 comments

[-]

fragment_me@reddit

Gemma is very sensitive to model and KV cache quant. Qwen not so much. I think Qwen Q6 is great, but I personally use Q8 and above. I would never touch Gemma 4 without being Q8 or better.

Reply

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

fragment_me@reddit (OP)

Surprisingly not high. The fan style does seem to be for servers, but I have them outside of my DELL R730 with a PCIE riser cable and I cut holes in the top of the case. I didn't notice them much. I definitely won't trust having them in the same room AND outside the case. If these were in a 2U server, they wouldn't be too loud.

Reply

Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

fragment_me@reddit (OP)

I would give it a shot. In terms of per GB cost, it's roughly the same, but the 4080 is newer; supported longer and may be a bit faster on PP. I went the ebay route just because it's easier.

Reply

I hope that someday we will have a 124B Gemma.

Posted by cgs019283@reddit | LocalLLaMA | View on Reddit | 77 comments

[-]

fragment_me@reddit

Are you on OpenWebUI? If so, you may need to set native tool calling as it's not a default. Also, there seems to be some bug in OpenWebUI where after 10+ tool calls it just stops. I've been able to repeat this using llama-server and vLLM as the backend. It's a shame vLLM doesn't bundle a webUI like llama-cpp does.

Reply

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 89 comments

[-]

fragment_me@reddit

If you're using F16 KV cache might as well try BF16 (if your hardware can handle it). See here for some KLD benchmarks for Qwen3.5. [https://techstat.net/qwen3-5-27b-q8-kv-cache-benchmarks-bf16-vs-f16-vs-q8\_0/](https://techstat.net/qwen3-5-27b-q8-kv-cache-benchmarks-bf16-vs-f16-vs-q8_0/)

Reply

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 89 comments

[-]

fragment_me@reddit

It depends on the model quite a bit I'm learning based on various benchmarks. Q8 *usually* is pretty good but degrades at long context. I wouldn't go lower. I personally stick to native KV cache quant now. For Qwen, that's actually BF16, not F16 as the default in llama CPP is. If you really want to go lower, reduce V but keep K higher. E.g. K as BF16 and V as Q8\_0.

Reply

"Elias Thorne" is what eight different LLMs name a lighthouse keeper. He's also selling cancer treatment advice on Amazon

Posted by prescorn@reddit | LocalLLaMA | View on Reddit | 54 comments

[-]

fragment_me@reddit

That's comical, and just reinforces how much I hate reading anything LLM generated. Ironically, I end up seeing this text so much because one of my favorite quick and dirty benchmarks for throughput is to tell a model to write me a 2000 word short story. In fact, I just did it now because I made some MTP setting change. Sure enough, here's the title - **The Last Clockmaker of New Veridia.**

Reply

"Elias Thorne" is what eight different LLMs name a lighthouse keeper. He's also selling cancer treatment advice on Amazon

Posted by prescorn@reddit | LocalLLaMA | View on Reddit | 54 comments

[-]

fragment_me@reddit

I've noticed this lighthouse keeper thing on Qwen so much it's annoying. Nice read.

Reply

MTP support merged into llama.cpp

Posted by tacticaltweaker@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

fragment_me@reddit

There are many benchmarks already out that show TQ4 performing just under Q4. And vLLM just released their benchmarks too on it. Any KV cache at 4 bits (whether TQ or Q) suffers as context window grows. When TQ KLD performs better than Q8, then I will jump on the band wagon. I really want it to succeed but so far it’s not. The only positive thing that came out of it is the hadamard rotation got attention and it was implemented in llama main and IK llama.

Reply

MTP support merged into llama.cpp

Posted by tacticaltweaker@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

fragment_me@reddit

No

Reply

MTP support merged into llama.cpp

Posted by tacticaltweaker@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

fragment_me@reddit

You are on some custom form because turbo quant is not available in llama cpp. I highly suggest you stick to Q8 kv cache if strapped for memory. Q4 is better than T4, yet 4 bit KV cache just doesnt maintain accuracy. Further, KV cache effects are more pronounced at higher context window. You’re better off switching to Q8 and halving the cache. Or just switch to Q5 model if you need more cache. Finally, you need the —spec-type parameter for MTP when you switch to the main repo and branch.

Reply

MTP support merged into llama.cpp

Posted by tacticaltweaker@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

fragment_me@reddit

No

Reply

MTP PR Merged!!!

Posted by Valuable_Touch5670@reddit | LocalLLaMA | View on Reddit | 101 comments

[-]

fragment_me@reddit

One thing I noticed is I could drop each 3090 down to 200 Watts and still get a speed up in token generation compared to no MTP.

Reply

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

fragment_me@reddit (OP)

The definition in this context is that the total amount of tokens generated during reasoning is less in every test I ran, as shown in the examples.

Reply