FiLo420blazeit

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 40 comments

FiLo420blazeit@reddit

Solid work. Does the checkpoint store an actual KV cache snapshot, or just enough metadata to find the longest reusable prefix on the next request? Curious how it handles mid-conversation edits vs pure tail-trimming, since opencode's rewrites aren't always append-only.

85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics

Posted by nathandreamfast@reddit | LocalLLaMA | View on Reddit | 64 comments

85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics

Posted by nathandreamfast@reddit | LocalLLaMA | View on Reddit | 64 comments

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Posted by 3VITAERC@reddit | LocalLLaMA | View on Reddit | 32 comments

FiLo420blazeit@reddit

Really useful breakdown, thanks for running this. The accept% column is doing all the work here, wherever it hits \~90% (the code prompts) you get a real speedup, and wherever it drops to \~40% (prose) MTP basically just adds overhead. That's expected behavior but it's nice to see it this cleanly on actual hardware. The spicy result is the 35B MoE on short story going *backwards* (0.81×). That's the worst case for speculative-style decoding: the base model is already fast (227 tok/s), so a low-acceptance draft can't earn back its own cost nd you net negative. The dense model never goes below 1.0× because its baseline is slow enough that even a bad draft is roughly free. Practical takeaway seems to be: enable MTP for structured/code-ish workloads, leave it off for creative/open-ended generatian, especially on the MoE. Curious what your draft settings were (number of speculative tokens, any threshold on acceptance)? Wonder if tuning those pulls the prose numbers back above 1.0× or if low accept% just kills it regardless.