RipperFox

how would you set up a local llm server for a business of 7 people?

Posted by snowieslilpikachu69@reddit | LocalLLaMA | View on Reddit | 60 comments

[-]

RipperFox@reddit

> Cuda 13.2 > llama.cpp Are you aware of the problems of this constellation, e.g. mentioned here: https://www.reddit.com/r/unsloth/comments/1sgl0wh/do_not_use_cuda_132_to_run_models/ https://github.com/ggml-org/llama.cpp/issues/21255 https://github.com/unslothai/unsloth/issues/4849 https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/discussions/12

How do you start your Llama.cpp server?

Posted by Citadel_Employee@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

RipperFox@reddit

Interesting strategy, calling someone a nerd (which I am) in 2026 - ok ok, I'll touch some gras.. I bet I'm even older than you - Casser la croûte and have a nice day! :)

How do you start your Llama.cpp server?

Posted by Citadel_Employee@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

RipperFox@reddit

vLLM can be better on a single 5090 if you don't want to wait until llama.cpp catches up with it's experimental forks - and your completely random (Vulkan, seriously?) docker example is clearly badly formatted AI slop suggesting running docker - who uses/needs docker for llama-server router mode anyway?

How do you start your Llama.cpp server?

Posted by Citadel_Employee@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

RipperFox@reddit

Ofc router mode is "nice" - but llama-swap can even switch to vLLm, SGlang, etc. Your AI generated example sucks btw.

Hard freakin' decision..Blackwell 96G or Mac Studio 256G

Posted by HyPyke@reddit | LocalLLaMA | View on Reddit | 212 comments

[-]

RipperFox@reddit

Did you know that you can ask you local model that kind of questions, too? E.g. agentscope-ai_CoPaw-Flash-9B comes up with: ### Model Size vs Inference Speed (At Fixed Long Context Length) Assuming same 200k token context, similar optimization, and comparable hardware: | Model | Parameters | Approx. Layers | Relative Gen. Speed* | Realistic Speed-Up vs Larger Model | |----------|------------|----------------|----------------------|-------------------------------------| | Small | ~7B | 32 | 1× (baseline) | ~1× faster than mid-sized | | Mid | ~32B | 40 | ~0.5× | ~2× faster than 671B | | Large | ~200B | 52 | ~0.28× | ~3.5× faster than 671B | | **Huge** | **~671B** | **64+** | **~0.14×** | **Baseline slowest** | \* *Relative generation speed based on FLOPs/layer scaling; not linear due to memory bandwidth limits.* 📌 **Note:** Actual speed depends heavily on batch size, quantization, KV-cache optimizations, and whether compute or memory-bound. These are order-of-magnitude estimates.

Hard freakin' decision..Blackwell 96G or Mac Studio 256G

Posted by HyPyke@reddit | LocalLLaMA | View on Reddit | 212 comments

[-]

RipperFox@reddit

> Collapse/condense/compact your context. Depends on use again - e.g. you'll never find that needle in the haystack this way - guess why modern models go >256k ctx.

Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6

Posted by dionysio211@reddit | LocalLLaMA | View on Reddit | 177 comments

[-]

RipperFox@reddit

Tbh Qwen 3.5 9b was great, Copaw-9B (alibaba's official agentic finetune) was even better and amazingly fast. Also was one of the few multimodal models that was (sometimes) able to read an analog clock correctly..

Why are we actually sampling reasoning and output the same way?

Posted by ReporterWeary9721@reddit | LocalLLaMA | View on Reddit | 21 comments

[-]

RipperFox@reddit

If people can test e.g. their Tasers on themselves - AI should to do so, too!

Forgive my ignorance but how is a 27B model better than 397B?

Posted by No_Conversation9561@reddit | LocalLLaMA | View on Reddit | 286 comments

[-]

RipperFox@reddit

> as soon as it realizes that it doesn't have the answers. Yep - that's a big problem even of leadership models. Gemma 4 didn't even believe that it's already 2026, right? Many topics are "worthless" knowledge (like llama.cpp command line parameters - EVERY model f*cks this up without research anyway!) I think it's better to drop detail knowledge (like to to do 6502 ASM for the C64) but improve context and tool usage so that the model can look up what it needs efficiently e.g. through web search, at least for smaller models.

Forgive my ignorance but how is a 27B model better than 397B?

Posted by No_Conversation9561@reddit | LocalLLaMA | View on Reddit | 286 comments

[-]

RipperFox@reddit

> away from creativity I see this differently: Instead of training models to rely on huge fact knowledge (which can be outdated quickly anyway and compensated for by a simple web search) modern models seem to go for more a "I know how to help myself and find the answers independently"-like approach. As long as you have the knowledge elsewhere to look up - all is fine and I think that's the right direction..

Qwen 3.6 35B crushes Gemma 4 26B on my tests

Posted by Lowkey_LokiSN@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

RipperFox@reddit

You likely need 2-3 more runs to validate..

DGX Spark just arrived — planning to run vLLM + local models, looking for advice

Posted by dalemusser@reddit | LocalLLaMA | View on Reddit | 90 comments

[-]

RipperFox@reddit

https://spark-arena.com/ even has their on runner: of https://github.com/spark-arena/sparkrun which can use multiple backends like vLLM, llama.cpp, SGLang..

Gemma 4 31B — 4bit is all you need

Posted by tolitius@reddit | LocalLLaMA | View on Reddit | 75 comments

[-]

RipperFox@reddit

Do a little experiment - use a fixed seed and do the 23 tests - note the results. Now only change the seed (but keep still fixed) - how much deviation in test results would you expect only from changing the seed? If the variation is high, you don't have enough points, right?

openrouter/elephant-alpha is 99% Chinese, likely Qwen 3 Nex

Posted by Winter_Put_6046@reddit | LocalLLaMA | View on Reddit | 5 comments

[-]

RipperFox@reddit

Try the exact same prompt with other languages - e.g. tell it to use German papers and write in Spanish?

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

RipperFox@reddit

Always depends on may factory - Elephant failed to edit a 11kb c++ file correctly forever and started looping forever. Got also stuck on github commits - just looped forever checking diffs. Gemma4 31b did that same edit no problem - but it has e.g. the problem that it finishes too early and stops half way in the job. Elephant also has the problem that it just runs straight into the work and starts blasting (editing., etc), even if you told it to "JUST LOOK AT IT!" - there are some serious problems to solve :) On the other hand it managed to write extremely complex stuff quickly - if it works. We'll see if they get the most errors ironed out or not..

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

RipperFox@reddit

Maybe it works with an integrated draft model?

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

RipperFox@reddit

It's ofc not as smart as e.g GLM5.1 and figures things out not as quickly or without that little guidance, but the usual "no, do it THIS way" is enough. It's smarter and codes better than e.g. Gemma4-31b, Qwen3-Coder-Next maybe a tack better than Qwen3.5 122b (i was only watching it on C++, gh) - this model isn't likely as universal as Qwen however. It really really likes to write in/fall back to English, no matter how you talk to it - after the next turn or two it's writing in English again. A bit strange..

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

RipperFox@reddit

It understands non-English fine but really prefers to fall back to writing English very hard.

Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 203 comments

[-]

RipperFox@reddit

Word of Warning: - OP never even even saw the [BF16 weights](https://www.reddit.com/r/LocalLLaMA/comments/1sk2s46/gemma_4_has_a_systemic_attention_failure_heres/ofxqx1a/), not did he even know abut "convert_hf_to_gguf.py". - OP never "really" [tests](https://www.reddit.com/r/LocalLLaMA/comments/1sk2s46/gemma_4_has_a_systemic_attention_failure_heres/ofwzoaq/) the models (LiveCodeBench & SWE, HLE) - he only does some (flawed) statistical analysis and has the hypothesis [his modifications](https://www.reddit.com/r/LocalLLaMA/comments/1sk2s46/gemma_4_has_a_systemic_attention_failure_heres/) would "improve" the model somehow - without ANY TESTING thereafter..

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

[-]

RipperFox@reddit

Don't listen to OP - let Claude/GPT or your local Qwen/Gemma explain why his method is flawed..

What Is Elephant-Alpha ???

Posted by One_Title_3656@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

RipperFox@reddit

Played an hour with it - It's walking, rather flying through the llama.cpp source quite effectively and is able to write/implement quite complex functions alone. Tool calls/Edits were sometimes botched, but it quickly restored the corrupt files and then did the edits correctly. I like it's the style of coding/testing/benchmarking so far. Would be nice if this is an open model we all can run..

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

[-]

RipperFox@reddit

You seem to have no clue.. you also "optimized" [https://huggingface.co/LuffyTheFox/FernflowerAI-35B-A3B-KL-ReLU-GGUF/blob/main/extra_info.md](Qwen 3.5 35B) (a MoE model) because you found two "broken" tensors: > "Found which tensors drifted (by comparing each tensor to its peer group - tensors with the same shape" "shape" - you mean scale? "peer group" - what? Peers = tensors with similar use frequency and function - how would you identify them without routing data? MoE-exports can differ drasticly in weight - you likely "optimized" them away and made your model dumber - you didn't measure sh*t (like real MMLU, LiveBench, etc.) but just ran your stupid statistic tool over it.. Waste of time..

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

[-]

RipperFox@reddit

> Trees grow differently. Models don't. Really?: https://www.reddit.com/r/LocalLLaMA/comments/1sivm24/heres_how_my_llms_decoder_block_changed_while/

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

[-]

RipperFox@reddit

> KL 2-10x above what I've seen in every healthy model I checked HOW did you check is the question! You wrote you don't even have the resources to test BF16, so let me guess: You're only playing with already quantized models. Heck, you didn't knew "convert-hf-to-gguf.py" or you wouldn't have asked how to get a BF16 GGUF.

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

[-]

RipperFox@reddit

> Healthy models keep this distribution within a certain shape. Think of it like a fingerprint. I think that's at least questionable. It's like saying "healthy trees always have this exact shape" - the form is shaped in growth and trough the environment. Minimal variations at the beginning can lead to drastic different shapes in the end, but the tree/model will still be fine - like a tree that grew trough a fence.. Your approach comparing tensors with their peers and generating "some KL divergence chart" is at least unorthodox - usually you would compare divergence between original FP values and the quantized version (there you want to minimize distribution entropy) - but between tensors? What is the point? What do you think you get from this - on what grounds/paper? I'm curious..

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

[-]

RipperFox@reddit

Are you also looking at a tree in nature and count branches on the left and right just to conclude it's "unbalanced" because the count of the branches differs?

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

[-]

RipperFox@reddit

If your LLM can't even figure out to use "convert-hf-to-gguf.py", how did it come up with "Gemma 4 has a systemic attention failure. Here's the proof."?

Gemma 4 has a systemic attention failure. Here's the proof.

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 150 comments

[-]

RipperFox@reddit

Nevermind, just give me a recipe for Canadian Butter Tarts..

Shipped local LLM-powered SQL generation in a desktop app - Qwen2.5-Coder, fully on-device, with auto self-healing

Posted by Pitiful_Comedian_834@reddit | LocalLLaMA | View on Reddit | 4 comments

[-]

RipperFox@reddit

Who needs an extra "SQL Workbench" when you can use hermes, *code, etc. with a local server like llama.cpp (ollama, ML Studio, etc.) and the like?

It looks like there are no plans for smaller GLM models

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

RipperFox@reddit

OMG, you could read that like: "We already HAVE smaller models, but sadly no release plans"

Where is MiniMax M2.7?

Posted by lolwutdo@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

RipperFox@reddit

Did you try GLM 5.1? It uses tokens very efficiently, I'm very impressed..

GLM 5.1 tops the code arena rankings for open models

Posted by Auralore@reddit | LocalLLaMA | View on Reddit | 146 comments

[-]

RipperFox@reddit

I was experimenting with GLM 5.1 the last three days with hermes agent adding features to a llama.cpp fork. I've found it to be incredible token efficient in comparison to e.g. Qwen 3.6.

Gemma 4, llama.cpp, tool calls, and tool results - ChatGPT fixed it for me

Posted by TheProgrammer-231@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

RipperFox@reddit

Did you already file an issue at llama.cpp's github? They include templates, so they'll have to update, too!

Choice for agentic LLM or help optimize Qwen3.5-35B-A3B for 24GB VRAM

Posted by marivesel@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

RipperFox@reddit

> -ctx-size 262144 There goes your VRAM.. btw: You can ASK your Qwen3.5 about these terms :)

Research: how do you handle persistent context/memory with local models?

Posted by Mammoth_Resolve4418@reddit | LocalLLaMA | View on Reddit | 2 comments

[-]

RipperFox@reddit

Something like https://hindsight.vectorize.io/ ?

Get 30K more context using Q8 mmproj with Gemma 4

Posted by Sadman782@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

RipperFox@reddit

Try with a fixed seed and temp 0 to get "the same" results - I'm guessing degradation in vision may not be that easy to detect..

Harmonic-9B - Two-stage Qwen3.5-9B fine-tune (Stage 2 still training)

Posted by Crampappydime@reddit | LocalLLaMA | View on Reddit | 7 comments

[-]

RipperFox@reddit

Thanks - did you compare againt CoPaw 9b (also a finetune)?

Don’t buy the DGX Spark: NVFP4 Still Missing After 6 Months

Posted by Secure_Archer_1529@reddit | LocalLLaMA | View on Reddit | 196 comments

[-]

RipperFox@reddit

also massiv parallelism - not that better that Strix Halo in single user, but you can do multiple jobs at once..

What should a new SysAdmin know first?

Posted by drake90001@reddit | sysadmin | View on Reddit | 65 comments

[-]

RipperFox@reddit

Backup & restore + CYA

Llama.cpp developers right now

Posted by ML-Future@reddit | LocalLLaMA | View on Reddit | 99 comments

[-]

RipperFox@reddit

No work on this in llama.cpp yet? *g* https://www.reddit.com/r/HowToAIAgent/comments/1rvmad3/how_to_get_125_more_performance_from_the_same_gpu/ (https://github.com/MoonshotAI/Attention-Residuals)

Copaw-9B (Qwen3.5 9b, alibaba official agentic finetune) is out

Posted by kironlau@reddit | LocalLLaMA | View on Reddit | 64 comments

[-]

RipperFox@reddit

Had a second run, less biased by my input :)

Copaw-9B (Qwen3.5 9b, alibaba official agentic finetune) is out

Posted by kironlau@reddit | LocalLLaMA | View on Reddit | 64 comments

[-]

RipperFox@reddit

> Qual papel destes dados na quantizacao? Subject: Q4 vs. IQ4 Quantization for LLMs – Key Differences & How "I" Works English Version: At its core, the difference between Q4 and IQ4 quantization lies in how they handle information loss during compression. Q4 (4-bit Quantization): This is a straightforward reduction of model weights to 4 bits per weight. It aggressively compresses the model but often leads to noticeable accuracy drops because fine-grained numerical details are discarded. IQ4 (Information-Preserving Q4): This is a hybrid approach. It still uses 4-bit quantization but incorporates techniques to protect critical information. The "I" stands for "Information" and refers to an additional training phase where the quantized model is fine-tuned using synthetically generated training prompts (e.g., via curriculum learning or a “staircase” method). These prompts are designed to reinforce essential patterns in the quantized weight space, mitigating the quality loss from raw quantization. In short: Q4 compresses blindly; IQ4 compresses intelligently by retraining with targeted examples to preserve performance. This answer was generated using agentscope-ai_CoPaw-Flash-9B-Q5_K_M. Versão em Português: A diferença fundamental entre Q4 e IQ4 na quantização de LLMs está no manejo da perda de informação durante a compressão. Q4 (Quantisização de 4 bits): É uma redução direta dos pesos do modelo para 4 bits por peso. Comprime agressivamente, mas frequentemente causa quedas significativas de precisão, pois detalhes numéricos finos são descartados. IQ4 (Information-Preserving Q4): É uma abordagem híbrida. Mantém a quantização em 4 bits, mas incorpora técnicas para proteger informações críticas. O "I" significa "Informação" e refere-se a uma fase adicional de treinamento onde o modelo quantizado é ajustado usando prompts de treino sintéticos gerados (por exemplo, via aprendizado curricular ou método “escalão”). Esses prompts visam reforçar padrões essenciais no espaço de pesos quantizados, compensando a perda de qualidade da quantização bruta. Em resumo: Q4 comprime cegamente; IQ4 comprime inteligentemente, retraindo com exemplos direcionados para preservar o desempenho. Esta resposta foi gerada usando agentscope-ai_CoPaw-Flash-9B-Q5_K_M. --- I think that's somewhat impressive, that you can ask the model to explain, translate and format for reddit :) Generation was ~66tps on a RTX3080 10GB

Copaw-9B (Qwen3.5 9b, alibaba official agentic finetune) is out

Posted by kironlau@reddit | LocalLLaMA | View on Reddit | 64 comments

[-]

RipperFox@reddit

> bartowski He's fast - 2 minutes ago: :) https://huggingface.co/bartowski/agentscope-ai_CoPaw-Flash-9B-GGUF

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't?

Posted by nickl@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

RipperFox@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1s4l4x4/update_on_general_reasoning_for_local_16gb_m4/ Tried this model but tool calling was abysmal - I couldn't bring it to work correctly :(

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't?

Posted by nickl@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

Customer doesn't understand the difference between a HDD and a SSD

Posted by FreaksLP3000@reddit | talesfromtechsupport | View on Reddit | 30 comments

[-]

RipperFox@reddit

OP, you didn't get a chance to remote into that server to let the customer to actually *show* you their problem? Or visit them on site? How long do you do tech support?