fragment_me

In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 18 comments
Has anyone experimented with stabilizing low quant models with lower temp and top p?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 12 comments
Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 24 comments
Decent deal on RTX 3080 20GB on ebay - $30 per GB

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 39 comments
Does anyone have a usable vLLM setup with Qwen3.6 27B + pipeline parallelism + MTP?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 34 comments
IK_LLAMA now supports Qwen3.5 MTP Support :O

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 37 comments
Has anyone here successfully extended Qwen3.5 or 3.6 context length paste 260k?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 11 comments
Interesting new model scoring strong on SWE bench - Multilingual-Multimodal-NLP/IndustrialCoder

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 6 comments
Is there a way to prioritize llama-cpp VRAM allocations to maximize local LLM usage alongside other apps?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 3 comments
Are there any plugin or all-in-one solutions for TTS interfacing with other local models?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 1 comments
Info on performance (accuracy) when context window reaches a certain size?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 2 comments