FluoroquinolonesKill

google/gemma-4-12B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

[-]

Gemma 4 MTP released

Posted by rerri@reddit | LocalLLaMA | View on Reddit | 301 comments

[-]

FluoroquinolonesKill@reddit

Hoping someone can clear this up. I thought speculative decoding was only useful if you could load the entire main model into VRAM. Happy to be corrected.

Local AI is the best

Posted by fake_agent_smith@reddit | LocalLLaMA | View on Reddit | 60 comments

[-]

This. That is probably what is motivating bro to work like that to begin with. Bro needs to do serious self examination. Source: me, an intellectual, judging people for making mistakes I recently learned to stop making.

If it works - don’t touch it: COMPETITION

Posted by awfulalexey@reddit | LocalLLaMA | View on Reddit | 112 comments

[-]

FluoroquinolonesKill@reddit

Indeed, mounting fans is a pretentious waste of time.

Gemma 4 - lazy model or am I crazy? (bit of a rant)

Posted by Pyrenaeda@reddit | LocalLLaMA | View on Reddit | 151 comments

[-]

FluoroquinolonesKill@reddit

Same. It wasn’t having it when I tried to give it the current date for web searches.

Gemma 4 31B vs Qwen 3.5 27B: Which is best for long context worklows? My THOUGHTS...

Posted by GrungeWerX@reddit | LocalLLaMA | View on Reddit | 174 comments

[-]

FluoroquinolonesKill@reddit

I am really frustrated at my good writing - and anyone’s good writing - now being mistaken for AI because people are too dumb to tell the difference.

More Gemma4 fixes in the past 24 hours

Posted by andy2na@reddit | LocalLLaMA | View on Reddit | 120 comments

[-]

FluoroquinolonesKill@reddit

Do we need custom templates with the latest GGUFs, or are the template fixes now embedded in the GGUFs?

Gemma 4 on Llama.cpp should be stable now

Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 167 comments

[-]

FluoroquinolonesKill@reddit

\> I strongly encourage running with \`--cache-ram 2048 -ctxcp 2\` to avoid system RAM problems On 26B A4B too?

I think my Gemma4 is having a breakdown

Posted by MrSilencerbob@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

FluoroquinolonesKill@reddit

Yeah Gemma was not having it when I tried to tell it what today’s date is. That seems like it should be something that any model should be able to accept. Hopefully it gets ironed out.

It looks like we’ll need to download the new Gemma 4 GGUFs

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 147 comments

[-]

FluoroquinolonesKill@reddit

What about the non-imatrix ones? Do those need to be re-downloaded? E.g., UD-Q8\_K\_XL and UD-Q4\_K\_XL.

so…. Qwen3.5 or Gemma 4?

Posted by MLExpert000@reddit | LocalLLaMA | View on Reddit | 121 comments

[-]

FluoroquinolonesKill@reddit

Gemma’s prose is better, but Qwen seems more chatty, engaged, and friendly.

Gemma 4 has been abliterated

Posted by coder3101@reddit | LocalLLaMA | View on Reddit | 26 comments

[-]

FluoroquinolonesKill@reddit

Yep. She goons.

Gemma 4 has been released

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 702 comments

[-]

FluoroquinolonesKill@reddit

Yes, local LLMs seem particularly well suited for that task.

Gemma 4 has been released

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 702 comments

[-]

FluoroquinolonesKill@reddit

Um...holy shit this thing has no qualms about enterprise resource planning. ;)

#OpenSource4o Movement Trending on Twitter/X - Release Opensource of GPT-4o

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 183 comments

[-]

FluoroquinolonesKill@reddit

Let ‘um goon.

I fine-tuned Qwen3.5-27B with 35k examples into an AI companion - after 2,000 conversations here’s what actually matters for personality

Posted by Crypto_Stoozy@reddit | LocalLLaMA | View on Reddit | 59 comments

[-]

FluoroquinolonesKill@reddit

She big mad

Heretic has FINALLY defeated GPT-OSS with a new experimental decensoring method called ARA

Posted by pigeon57434@reddit | LocalLLaMA | View on Reddit | 152 comments

[-]

FluoroquinolonesKill@reddit

Makes sense! I tried the ARA GPT-OSS, and it seems like the goal has been achieved.

Heretic has FINALLY defeated GPT-OSS with a new experimental decensoring method called ARA

Posted by pigeon57434@reddit | LocalLLaMA | View on Reddit | 152 comments

[-]

FluoroquinolonesKill@reddit

Isn’t part of the issue is that GPT-OSS was not trained on “sensitive data,” so even if it does not issue a refusal, the response might not be desirable?

webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 59 comments

[-]

FluoroquinolonesKill@reddit

No. In the chat turn, there’s the error message, and there’s a little arrow to expand it and then an option to enable the proxy. It took me 15 minutes this morning to find it, because I was not expecting to have to enable the option there. And, that was even after I passed the flag to enable the proxy.

webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 59 comments

[-]

FluoroquinolonesKill@reddit

Expand the thingy below the error. There it will allow you to enable the proxy. Clearly some UI cleanup is needed, but this is just a preview.

webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 59 comments

[-]

FluoroquinolonesKill@reddit

Thanks! Possible bug: When I enable a MCP server in the global settings, it does not remember. So, when I start a new chat, I have to re-enable the MCP server either in the chat or the global settings. I.e., starting a new chat and then inspecting the global settings shows the MCP server disabled, despite the fact that it was previously enabled.

Qwen3.5 35b UD Q4 K XL Prior to 3/5 worked great, now not so much...

Posted by thejacer@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

FluoroquinolonesKill@reddit

> thought jinja enabled was default According to the help file, it is indeed. I do not pass --jinja on my end.

webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 59 comments

[-]

FluoroquinolonesKill@reddit

This is huge. Does anyone know of a MCP server that can accept web searches to, say, Duck Duck Go? Is that a thing?

Qwen3.5 "Low Reasoning Effort" trick in llama-server

Posted by coder543@reddit | LocalLLaMA | View on Reddit | 22 comments

[-]

FluoroquinolonesKill@reddit

This works when I set it in the WebUI, but it does not work when I try to pass the parameters in the .ini file like this: `logit-bias = 248069+13.3` `grammar = "root ::= pre <[248069]> post\npre ::= !<[248069]>*\npost ::= !<[248069]>*"` Any ideas?

Qwen 3.5 27-35-122B - Jinja Template Modification (Based on Bartowski's Jinja) - No thinking by default - straight quick answers, need thinking? simple activation with "/think" command anywhere in the system prompt.

Posted by -Ellary-@reddit | LocalLLaMA | View on Reddit | 26 comments

[-]

FluoroquinolonesKill@reddit

In llama.cpp, can’t you just do --reasoning-budget 0? That’s what I did. Seems to work fine.

Qwen 3.5 Jinja Template – Restores Qwen /no_thinking behavior!

Posted by Substantial_Swan_144@reddit | LocalLLaMA | View on Reddit | 14 comments

[-]

FluoroquinolonesKill@reddit

In llama.cpp, can’t you just do --reasoning-budget 0? That’s what I did. Seems to work fine.

You can use Qwen3.5 without thinking

Posted by guiopen@reddit | LocalLLaMA | View on Reddit | 86 comments

[-]

FluoroquinolonesKill@reddit

That’s what I did. Seems to work fine. Is one method or the other better?

Nemo 30B is insane. 1M+ token CTX on one 3090

Posted by Dismal-Effect-1914@reddit | LocalLLaMA | View on Reddit | 112 comments

[-]

FluoroquinolonesKill@reddit

Do you really need —cpu-moe with —fit? I thought —fit would handle all that.

ACE-Step-1.5 has just been released. It’s an MIT-licensed open source audio generative model with performance close to commercial platforms like Suno

Posted by iGermanProd@reddit | LocalLLaMA | View on Reddit | 138 comments

[-]

FluoroquinolonesKill@reddit

A dataset of what?

GLM 4.7 Flash: Huge performance improvement with -kvu

Posted by TokenRingAI@reddit | LocalLLaMA | View on Reddit | 72 comments

[-]

FluoroquinolonesKill@reddit

Doesn't make a difference on my 5060 rig and the latest llama.cpp build.

KV cache fix for GLM 4.7 Flash

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 73 comments

[-]

FluoroquinolonesKill@reddit

This at least doubles the speed on my rig. Now I am getting about 30 t/s. Before, I was getting about 10-13 t/s.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700)

Posted by sloptimizer@reddit | LocalLLaMA | View on Reddit | 99 comments

[-]

FluoroquinolonesKill@reddit

Does it goon good?

GLM 4.7 Flash official support merged in llama.cpp

Posted by ayylmaonade@reddit | LocalLLaMA | View on Reddit | 64 comments

[-]

FluoroquinolonesKill@reddit

First impression: Running with the llama.cpp WebUI. reasoning-budget = 0 disables the reasoning. I am using temp = 1.0, top-k = 64, min-p = 0.00, top-p = 0.95, and dry-multiplier = 1.1. I am impressed with its ability to do role play and therapy. I have not seen any GPT slop, e.g. "it's not x, but y." I am getting about 8 t/s with flash attention off. Hopefully the speed improves. This might be a great candidate for fine tuning for role play.

GLM 4.7 Flash official support merged in llama.cpp

Posted by ayylmaonade@reddit | LocalLLaMA | View on Reddit | 64 comments

[-]

FluoroquinolonesKill@reddit

Setting reasoning-budget = 0 in llama-server.exe prevents it from reasoning on my machine.

If you dont think Ai is an emergency you are about to have issues...

Posted by CannyGardener@reddit | preppers | View on Reddit | 813 comments

[-]

FluoroquinolonesKill@reddit

> Thing is, AI can't remain this inexpensive forever. Yes it can. Local AI is free aside from fixed setup costs. Small local models are extremely useful for the types of tasks OP said he automated. People don’t realize how powerful local AI can be, even on consumer hardware. > IMO the AI bubble collapsing is a far greater concern than AI taking all our jobs. That bubble is far larger than the housing bubble was, and is the only part of the US's GDP that has seen growth in the last 6 months. Most of the money caught up in this is investors hoping to get a return, and it's increasingly clear that they won't be able to get that return. There are significant differences between this and the housing bubble. A lot of the investment in AI is coming from actual cash that large companies have on hand, unlike the sub-prime derivatives market that popped the housing bubble.

Mistral Small Creative!?

Posted by LoveMind_AI@reddit | LocalLLaMA | View on Reddit | 22 comments

[-]

FluoroquinolonesKill@reddit

Thanks. Apparently, there is a bug in the llama.cpp Web UI I am using which was causing the wrong sampler settings to be applied, so all my tests are off. Sigh.

Mistral Small Creative!?

Posted by LoveMind_AI@reddit | LocalLLaMA | View on Reddit | 22 comments

[-]

FluoroquinolonesKill@reddit

What sampling settings are you using?

Mistral Small Creative!?

Posted by LoveMind_AI@reddit | LocalLLaMA | View on Reddit | 22 comments

[-]

FluoroquinolonesKill@reddit

I found 14B instruct follows prompts too rigidly. 14B reasoning without reasoning enabled seems much better.

My little decentralized Locallama setup, 216gb VRAM

Posted by Goldkoron@reddit | LocalLLaMA | View on Reddit | 154 comments

[-]

FluoroquinolonesKill@reddit

Macs are boutique products aimed at receptions of hair salons. They are used by bouffanted ponce gaylords who cannot handle anything more complex than one mouse button. They are hideously expensive and utterly restrictive. Real men use windows and get the fucking job done with raw power and unlimited options.

Mistral 3 14b against the competition ?

Posted by EffectiveGlove1651@reddit | LocalLLaMA | View on Reddit | 25 comments

[-]

FluoroquinolonesKill@reddit

I think it is just naturally wild. Perhaps the recommended sampling values are not tight enough. Try playing with the sampling parameters. I tried cranking Top-K down to 5, and the results are much more controlled and coherent. Here is a helpful link that lets you play with sampling parameters in isolation to see what they do: [https://artefact2.github.io/llm-sampling/index.xhtml](https://artefact2.github.io/llm-sampling/index.xhtml)

My experiences with the new Ministral 3 14B Reasoning 2512 Q8

Posted by egomarker@reddit | LocalLLaMA | View on Reddit | 106 comments

[-]

FluoroquinolonesKill@reddit

I added it as a system prompt. I placed it right before my main system prompt. Unsloth has one with a little different formatting. Here is Unsloth's: `<s>[SYSTEM_PROMPT]# HOW YOU SHOULD THINK AND ANSWER` `First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.` `Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response to the user.[/THINK]Here, provide a self-contained response.[/SYSTEM_PROMPT][INST]What is 1+1?[/INST]2</s>[INST]What is 2+2?[/INST]` I was able to get it working - for now - using this portion of Unsloth's, which again, is placed right before my main system prompt. `[SYSTEM_PROMPT]# HOW YOU SHOULD THINK AND ANSWER` `First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.` `Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response to the user.[/THINK]Here, provide a self-contained response.[/SYSTEM_PROMPT]`

My experiences with the new Ministral 3 14B Reasoning 2512 Q8

Posted by egomarker@reddit | LocalLLaMA | View on Reddit | 106 comments

[-]

FluoroquinolonesKill@reddit

I added it as a system prompt. I placed it right before my main system prompt. Unsloth has one with a little different formatting. I will try it and report back.

My experiences with the new Ministral 3 14B Reasoning 2512 Q8

Posted by egomarker@reddit | LocalLLaMA | View on Reddit | 106 comments

[-]

FluoroquinolonesKill@reddit

Thank you. That seems to work. It's kind of a workflow faf, having to add that for only this model. Hopefully future iterations will make it unnecessary. Perhaps other front ends can make this easier, but I am using llama.cpp's new Web UI, which is pretty basic.

My experiences with the new Ministral 3 14B Reasoning 2512 Q8

Posted by egomarker@reddit | LocalLLaMA | View on Reddit | 106 comments

[-]

FluoroquinolonesKill@reddit

I have been testing out Ministral-3 8b and 14b reasoning. I am really liking 14b a lot. I compared it to Gemma-3 12b and 27b, and I am liking Ministral-3 14b more. Ministral-3 might actually replace Gemma-3 for my RP/creative writing daily driver. I tried the reasoning model, but the reasoning seems broken. Something might be wrong with the chat template. Idk. All my tests have been with the reasoning model where it is just working without reasoning. I am about to try the instruct model to compare.

Mistral 3 Blog post

Posted by rerri@reddit | LocalLLaMA | View on Reddit | 173 comments

[-]

FluoroquinolonesKill@reddit

Ministral-3 has been released

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 61 comments

[-]

FluoroquinolonesKill@reddit

The reasoning models (8b and 14b) are not reasoning. Is there something wrong with the embedded chat template? I tried the Unsloth and MistralAI GGUFs from a few hours ago. I am using the latest llama.cpp. It looks like Unsloth has updated the GGUFs as of 20 minutes ago. I am pulling them now and will report back to this comment.

I have a RTX5090 and an AMD AI MAX+ 95 128GB. Which benchmark do you want me to run?

Posted by foogitiff@reddit | LocalLLaMA | View on Reddit | 36 comments

[-]

FluoroquinolonesKill@reddit

Yeah, after some research, I came to the same conclusion. It is frustrating that I had to spend so much time researching to determine that this processor - which is marketed as an "AI processor" with tons of memory - is actually not great for dense models. I think I saw one person say they were getting 3 t/s with Gemma-3-27b on Strix Halo. I can get that on my 8GB Nvidia. I can also run Qwen3-30b-3Ab just fine. The only thing Strix Halo offers me is the ability to run larger MoE models. Until and if those models become the norm, then Strix Halo is not a good buy for me. All that said, I am grateful for the people that have put Strix Halo through its paces and published their results. They have helped a lot of people make informed decisoins.

I have a RTX5090 and an AMD AI MAX+ 95 128GB. Which benchmark do you want me to run?

Posted by foogitiff@reddit | LocalLLaMA | View on Reddit | 36 comments

[-]

FluoroquinolonesKill@reddit

Gemma3-27b: 32k context fully loaded; Q4_K_M. Gemma3-12b: 32k context fully loaded; Q4_K_M. Measure: pp, tps, VRAM usage.

Budget Hardware Recommendations (1.3k)

Posted by xxxmralbinoxxx@reddit | LocalLLaMA | View on Reddit | 5 comments

[-]

FluoroquinolonesKill@reddit

Thanks for your comment. I'm in a similar boat. I want to run Gemma3-27b and Mistral-Small-24b with full context. Would the 64GB (48GB VRAM) AI Max+395 handle that just fine? Would the 128GB (96GB VRAM) be overkill?

if open-webui is trash, whats the next best thing available to use?

Posted by Tricky_Reflection_75@reddit | LocalLLaMA | View on Reddit | 173 comments

[-]

FluoroquinolonesKill@reddit

Yes. It’s great. There are some things they need to enhance, but it just works.