RipperFox

how would you set up a local llm server for a business of 7 people?

Posted by snowieslilpikachu69@reddit | LocalLLaMA | View on Reddit | 60 comments

RipperFox@reddit

> Cuda 13.2 > llama.cpp Are you aware of the problems of this constellation, e.g. mentioned here: https://www.reddit.com/r/unsloth/comments/1sgl0wh/do_not_use_cuda_132_to_run_models/ https://github.com/ggml-org/llama.cpp/issues/21255 https://github.com/unslothai/unsloth/issues/4849 https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/discussions/12

How do you start your Llama.cpp server?

Posted by Citadel_Employee@reddit | LocalLLaMA | View on Reddit | 40 comments

RipperFox@reddit

Interesting strategy, calling someone a nerd (which I am) in 2026 - ok ok, I'll touch some gras.. I bet I'm even older than you - Casser la croûte and have a nice day! :)

How do you start your Llama.cpp server?

Posted by Citadel_Employee@reddit | LocalLLaMA | View on Reddit | 40 comments

RipperFox@reddit

vLLM can be better on a single 5090 if you don't want to wait until llama.cpp catches up with it's experimental forks - and your completely random (Vulkan, seriously?) docker example is clearly badly formatted AI slop suggesting running docker - who uses/needs docker for llama-server router mode anyway?

How do you start your Llama.cpp server?

Posted by Citadel_Employee@reddit | LocalLLaMA | View on Reddit | 40 comments

Hard freakin' decision..Blackwell 96G or Mac Studio 256G

Posted by HyPyke@reddit | LocalLLaMA | View on Reddit | 212 comments

RipperFox@reddit

Did you know that you can ask you local model that kind of questions, too? E.g. agentscope-ai_CoPaw-Flash-9B comes up with: ### Model Size vs Inference Speed (At Fixed Long Context Length) Assuming same 200k token context, similar optimization, and comparable hardware: | Model | Parameters | Approx. Layers | Relative Gen. Speed* | Realistic Speed-Up vs Larger Model | |----------|------------|----------------|----------------------|-------------------------------------| | Small | ~7B | 32 | 1× (baseline) | ~1× faster than mid-sized | | Mid | ~32B | 40 | ~0.5× | ~2× faster than 671B | | Large | ~200B | 52 | ~0.28× | ~3.5× faster than 671B | | **Huge** | **~671B** | **64+** | **~0.14×** | **Baseline slowest** | \* *Relative generation speed based on FLOPs/layer scaling; not linear due to memory bandwidth limits.* 📌 **Note:** Actual speed depends heavily on batch size, quantization, KV-cache optimizations, and whether compute or memory-bound. These are order-of-magnitude estimates.

Hard freakin' decision..Blackwell 96G or Mac Studio 256G

Posted by HyPyke@reddit | LocalLLaMA | View on Reddit | 212 comments

RipperFox@reddit

> Collapse/condense/compact your context. Depends on use again - e.g. you'll never find that needle in the haystack this way - guess why modern models go >256k ctx.

Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6

Posted by dionysio211@reddit | LocalLLaMA | View on Reddit | 177 comments

RipperFox@reddit

Tbh Qwen 3.5 9b was great, Copaw-9B (alibaba's official agentic finetune) was even better and amazingly fast. Also was one of the few multimodal models that was (sometimes) able to read an analog clock correctly..

Why are we actually sampling reasoning and output the same way?

Posted by ReporterWeary9721@reddit | LocalLLaMA | View on Reddit | 21 comments

Forgive my ignorance but how is a 27B model better than 397B?

Posted by No_Conversation9561@reddit | LocalLLaMA | View on Reddit | 286 comments

RipperFox@reddit

> as soon as it realizes that it doesn't have the answers. Yep - that's a big problem even of leadership models. Gemma 4 didn't even believe that it's already 2026, right? Many topics are "worthless" knowledge (like llama.cpp command line parameters - EVERY model f*cks this up without research anyway!) I think it's better to drop detail knowledge (like to to do 6502 ASM for the C64) but improve context and tool usage so that the model can look up what it needs efficiently e.g. through web search, at least for smaller models.

Forgive my ignorance but how is a 27B model better than 397B?

Posted by No_Conversation9561@reddit | LocalLLaMA | View on Reddit | 286 comments

RipperFox@reddit

> away from creativity I see this differently: Instead of training models to rely on huge fact knowledge (which can be outdated quickly anyway and compensated for by a simple web search) modern models seem to go for more a "I know how to help myself and find the answers independently"-like approach. As long as you have the knowledge elsewhere to look up - all is fine and I think that's the right direction..

Qwen 3.6 35B crushes Gemma 4 26B on my tests

Posted by Lowkey_LokiSN@reddit | LocalLLaMA | View on Reddit | 116 comments

DGX Spark just arrived — planning to run vLLM + local models, looking for advice

Posted by dalemusser@reddit | LocalLLaMA | View on Reddit | 90 comments

RipperFox@reddit

https://spark-arena.com/ even has their on runner: of https://github.com/spark-arena/sparkrun which can use multiple backends like vLLM, llama.cpp, SGLang..

Gemma 4 31B — 4bit is all you need

Posted by tolitius@reddit | LocalLLaMA | View on Reddit | 75 comments

RipperFox@reddit

Do a little experiment - use a fixed seed and do the 23 tests - note the results. Now only change the seed (but keep still fixed) - how much deviation in test results would you expect only from changing the seed? If the variation is high, you don't have enough points, right?

openrouter/elephant-alpha is 99% Chinese, likely Qwen 3 Nex

Posted by Winter_Put_6046@reddit | LocalLLaMA | View on Reddit | 5 comments

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

RipperFox@reddit

Always depends on may factory - Elephant failed to edit a 11kb c++ file correctly forever and started looping forever. Got also stuck on github commits - just looped forever checking diffs. Gemma4 31b did that same edit no problem - but it has e.g. the problem that it finishes too early and stops half way in the job. Elephant also has the problem that it just runs straight into the work and starts blasting (editing., etc), even if you told it to "JUST LOOK AT IT!" - there are some serious problems to solve :) On the other hand it managed to write extremely complex stuff quickly - if it works. We'll see if they get the most errors ironed out or not..

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

RipperFox@reddit

It's ofc not as smart as e.g GLM5.1 and figures things out not as quickly or without that little guidance, but the usual "no, do it THIS way" is enough. It's smarter and codes better than e.g. Gemma4-31b, Qwen3-Coder-Next maybe a tack better than Qwen3.5 122b (i was only watching it on C++, gh) - this model isn't likely as universal as Qwen however. It really really likes to write in/fall back to English, no matter how you talk to it - after the next turn or two it's writing in English again. A bit strange..

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 203 comments

RipperFox@reddit

Word of Warning: - OP never even even saw the [BF16 weights](https://www.reddit.com/r/LocalLLaMA/comments/1sk2s46/gemma_4_has_a_systemic_attention_failure_heres/ofxqx1a/), not did he even know abut "convert_hf_to_gguf.py". - OP never "really" [tests](https://www.reddit.com/r/LocalLLaMA/comments/1sk2s46/gemma_4_has_a_systemic_attention_failure_heres/ofwzoaq/) the models (LiveCodeBench & SWE, HLE) - he only does some (flawed) statistical analysis and has the hypothesis [his modifications](https://www.reddit.com/r/LocalLLaMA/comments/1sk2s46/gemma_4_has_a_systemic_attention_failure_heres/) would "improve" the model somehow - without ANY TESTING thereafter..

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

RipperFox@reddit

Played an hour with it - It's walking, rather flying through the llama.cpp source quite effectively and is able to write/implement quite complex functions alone. Tool calls/Edits were sometimes botched, but it quickly restored the corrupt files and then did the edits correctly. I like it's the style of coding/testing/benchmarking so far. Would be nice if this is an open model we all can run..

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

RipperFox@reddit

You seem to have no clue.. you also "optimized" [https://huggingface.co/LuffyTheFox/FernflowerAI-35B-A3B-KL-ReLU-GGUF/blob/main/extra_info.md](Qwen 3.5 35B) (a MoE model) because you found two "broken" tensors: > "Found which tensors drifted (by comparing each tensor to its peer group - tensors with the same shape" "shape" - you mean scale? "peer group" - what? Peers = tensors with similar use frequency and function - how would you identify them without routing data? MoE-exports can differ drasticly in weight - you likely "optimized" them away and made your model dumber - you didn't measure sh*t (like real MMLU, LiveBench, etc.) but just ran your stupid statistic tool over it.. Waste of time..

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

RipperFox@reddit

> Trees grow differently. Models don't. Really?: https://www.reddit.com/r/LocalLLaMA/comments/1sivm24/heres_how_my_llms_decoder_block_changed_while/

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

RipperFox@reddit

> KL 2-10x above what I've seen in every healthy model I checked HOW did you check is the question! You wrote you don't even have the resources to test BF16, so let me guess: You're only playing with already quantized models. Heck, you didn't knew "convert-hf-to-gguf.py" or you wouldn't have asked how to get a BF16 GGUF.

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

RipperFox@reddit

> Healthy models keep this distribution within a certain shape. Think of it like a fingerprint. I think that's at least questionable. It's like saying "healthy trees always have this exact shape" - the form is shaped in growth and trough the environment. Minimal variations at the beginning can lead to drastic different shapes in the end, but the tree/model will still be fine - like a tree that grew trough a fence.. Your approach comparing tensors with their peers and generating "some KL divergence chart" is at least unorthodox - usually you would compare divergence between original FP values and the quantized version (there you want to minimize distribution entropy) - but between tensors? What is the point? What do you think you get from this - on what grounds/paper? I'm curious..

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

RipperFox@reddit

Are you also looking at a tree in nature and count branches on the left and right just to conclude it's "unbalanced" because the count of the branches differs?

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

RipperFox@reddit

If your LLM can't even figure out to use "convert-hf-to-gguf.py", how did it come up with "Gemma 4 has a systemic attention failure. Here's the proof."?

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

Shipped local LLM-powered SQL generation in a desktop app - Qwen2.5-Coder, fully on-device, with auto self-healing

Posted by Pitiful_Comedian_834@reddit | LocalLLaMA | View on Reddit | 4 comments

RipperFox@reddit

Who needs an extra "SQL Workbench" when you can use hermes, *code, etc. with a local server like llama.cpp (ollama, ML Studio, etc.) and the like?

It looks like there are no plans for smaller GLM models

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 128 comments

Where is MiniMax M2.7?

Posted by lolwutdo@reddit | LocalLLaMA | View on Reddit | 18 comments

GLM 5.1 tops the code arena rankings for open models

Posted by Auralore@reddit | LocalLLaMA | View on Reddit | 146 comments

RipperFox@reddit

I was experimenting with GLM 5.1 the last three days with hermes agent adding features to a llama.cpp fork. I've found it to be incredible token efficient in comparison to e.g. Qwen 3.6.

Gemma 4, llama.cpp, tool calls, and tool results - ChatGPT fixed it for me

Posted by TheProgrammer-231@reddit | LocalLLaMA | View on Reddit | 48 comments

Choice for agentic LLM or help optimize Qwen3.5-35B-A3B for 24GB VRAM

Posted by marivesel@reddit | LocalLLaMA | View on Reddit | 24 comments

Research: how do you handle persistent context/memory with local models?

Posted by Mammoth_Resolve4418@reddit | LocalLLaMA | View on Reddit | 2 comments

Get 30K more context using Q8 mmproj with Gemma 4

Posted by Sadman782@reddit | LocalLLaMA | View on Reddit | 18 comments

Harmonic-9B - Two-stage Qwen3.5-9B fine-tune (Stage 2 still training)

Posted by Crampappydime@reddit | LocalLLaMA | View on Reddit | 7 comments

Don’t buy the DGX Spark: NVFP4 Still Missing After 6 Months

Posted by Secure_Archer_1529@reddit | LocalLLaMA | View on Reddit | 196 comments

What should a new SysAdmin know first?

Posted by drake90001@reddit | sysadmin | View on Reddit | 65 comments

Llama.cpp developers right now

Posted by ML-Future@reddit | LocalLLaMA | View on Reddit | 99 comments

RipperFox@reddit

No work on this in llama.cpp yet? *g* https://www.reddit.com/r/HowToAIAgent/comments/1rvmad3/how_to_get_125_more_performance_from_the_same_gpu/ (https://github.com/MoonshotAI/Attention-Residuals)

Copaw-9B (Qwen3.5 9b, alibaba official agentic finetune) is out

Posted by kironlau@reddit | LocalLLaMA | View on Reddit | 64 comments

Copaw-9B (Qwen3.5 9b, alibaba official agentic finetune) is out

Posted by kironlau@reddit | LocalLLaMA | View on Reddit | 64 comments

RipperFox@reddit

> Qual papel destes dados na quantizacao? Subject: Q4 vs. IQ4 Quantization for LLMs – Key Differences & How "I" Works English Version: At its core, the difference between Q4 and IQ4 quantization lies in how they handle information loss during compression. Q4 (4-bit Quantization): This is a straightforward reduction of model weights to 4 bits per weight. It aggressively compresses the model but often leads to noticeable accuracy drops because fine-grained numerical details are discarded. IQ4 (Information-Preserving Q4): This is a hybrid approach. It still uses 4-bit quantization but incorporates techniques to protect critical information. The "I" stands for "Information" and refers to an additional training phase where the quantized model is fine-tuned using synthetically generated training prompts (e.g., via curriculum learning or a “staircase” method). These prompts are designed to reinforce essential patterns in the quantized weight space, mitigating the quality loss from raw quantization. In short: Q4 compresses blindly; IQ4 compresses intelligently by retraining with targeted examples to preserve performance. This answer was generated using agentscope-ai_CoPaw-Flash-9B-Q5_K_M. Versão em Português: A diferença fundamental entre Q4 e IQ4 na quantização de LLMs está no manejo da perda de informação durante a compressão. Q4 (Quantisização de 4 bits): É uma redução direta dos pesos do modelo para 4 bits por peso. Comprime agressivamente, mas frequentemente causa quedas significativas de precisão, pois detalhes numéricos finos são descartados. IQ4 (Information-Preserving Q4): É uma abordagem híbrida. Mantém a quantização em 4 bits, mas incorpora técnicas para proteger informações críticas. O "I" significa "Informação" e refere-se a uma fase adicional de treinamento onde o modelo quantizado é ajustado usando prompts de treino sintéticos gerados (por exemplo, via aprendizado curricular ou método “escalão”). Esses prompts visam reforçar padrões essenciais no espaço de pesos quantizados, compensando a perda de qualidade da quantização bruta. Em resumo: Q4 comprime cegamente; IQ4 comprime inteligentemente, retraindo com exemplos direcionados para preservar o desempenho. Esta resposta foi gerada usando agentscope-ai_CoPaw-Flash-9B-Q5_K_M. --- I think that's somewhat impressive, that you can ask the model to explain, translate and format for reddit :) Generation was ~66tps on a RTX3080 10GB

Copaw-9B (Qwen3.5 9b, alibaba official agentic finetune) is out

Posted by kironlau@reddit | LocalLLaMA | View on Reddit | 64 comments

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't?

Posted by nickl@reddit | LocalLLaMA | View on Reddit | 40 comments

RipperFox@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1s4l4x4/update_on_general_reasoning_for_local_16gb_m4/ Tried this model but tool calling was abysmal - I couldn't bring it to work correctly :(

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't?

Posted by nickl@reddit | LocalLLaMA | View on Reddit | 40 comments

The Hardest Thing: Building and Running the UNIX Kernel from Original Sources

Posted by MatchingTurret@reddit | linux | View on Reddit | 34 comments

Customer doesn't understand the difference between a HDD and a SSD

Posted by FreaksLP3000@reddit | talesfromtechsupport | View on Reddit | 30 comments

RipperFox@reddit

OP, you didn't get a chance to remote into that server to let the customer to actually *show* you their problem? Or visit them on site? How long do you do tech support?