nickless07

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint

Posted by fulgencio_batista@reddit | LocalLLaMA | View on Reddit | 116 comments

google/gemma-4-12B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 276 comments

google/gemma-4-12B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 276 comments

google/gemma-4-12B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 276 comments

nickless07@reddit

It literally states 'release b9482' - Check out llama.cpp directly, as of now latest is b9493 - Not much behind, but no support for the new model yet.

google/gemma-4-12B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 276 comments

Stop asking what model to run. There are literally only two.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 569 comments

nickless07@reddit

I thought it is a normal thing theese days after reading headlines like this [https://www.sciencealert.com/almost-75-of-american-teens-have-used-ai-companions-study-finds](https://www.sciencealert.com/almost-75-of-american-teens-have-used-ai-companions-study-finds) [https://www.forbes.com/sites/johnkoetsier/2025/04/29/80-of-gen-zers-would-marry-an-ai-study/](https://www.forbes.com/sites/johnkoetsier/2025/04/29/80-of-gen-zers-would-marry-an-ai-study/) [https://www.foxcarolina.com/2025/12/11/mental-health-experts-warn-against-ai-companions-70-teens-seek-digital-friendships/](https://www.foxcarolina.com/2025/12/11/mental-health-experts-warn-against-ai-companions-70-teens-seek-digital-friendships/) [https://www.pymnts.com/artificial-intelligence-2/2026/ai-is-becoming-the-new-companion-for-aging-americans/](https://www.pymnts.com/artificial-intelligence-2/2026/ai-is-becoming-the-new-companion-for-aging-americans/)

Misunderstanding memory usage - 11.68gb quantized model takes up 22gb of RAM?

Posted by NotARedditUser3@reddit | LocalLLaMA | View on Reddit | 17 comments

nickless07@reddit

An integrated GPU (iGPU) uses shared system memory instead of having its own dedicated VRAM. The amount can be dynamically allocated by the system.

Misunderstanding memory usage - 11.68gb quantized model takes up 22gb of RAM?

Posted by NotARedditUser3@reddit | LocalLLaMA | View on Reddit | 17 comments

nickless07@reddit

Your Operating System automatically maps the gguf model file into virtual memory. The OS treats this file cache as expendable RAM. When you combine mmap with **'**Keep Model in Memory**'** LM Studio commands the backend to lock those specific pages into the active runtime process so they cannot be paged out

Misunderstanding memory usage - 11.68gb quantized model takes up 22gb of RAM?

Posted by NotARedditUser3@reddit | LocalLLaMA | View on Reddit | 17 comments

nickless07@reddit

Turn off Keep model in memory. It still will take a good amount of RAM, as you figured by the filze size, but not a copy of the model for quick reload/swap.

How do I improve my T/S

Posted by KneelB4S8n@reddit | LocalLLaMA | View on Reddit | 9 comments

For those creating personal assistants locally - how has short/long term memory impacted your experience?

Posted by GrungeWerX@reddit | LocalLLaMA | View on Reddit | 50 comments

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

Posted by old-mike@reddit | LocalLLaMA | View on Reddit | 41 comments

nickless07@reddit

Idk. didn't used mmproj and only 64k ctx main llama.cpp with F16: \[34m51.08.459.904\[0m \[32mI \[0mslot print\_timing: id 0 | task 10054 | prompt eval time = 21635.15 ms / 3756 tokens ( 5.76 ms per token, 173.61 tokens per second) \[34m51.08.459.911\[0m \[32mI \[0mslot print\_timing: id 0 | task 10054 | eval time = 106836.28 ms / 2450 tokens ( 43.61 ms per token, 22.93 tokens per second) \[34m51.08.459.913\[0m \[32mI \[0mslot print\_timing: id 0 | task 10054 | total time = 128471.43 ms / 6206 tokens \[34m51.08.459.914\[0m \[32mI \[0mslot print\_timing: id 0 | task 10054 | graphs reused = 12373 and well.... slot print\_timing: id 0 | task 0 | prompt eval time = 15287.59 ms / 5185 tokens ( 2.95 ms per token, 339.16 tokens per second) eval time = 3577.80 ms / 89 tokens ( 40.20 ms per token, 24.88 tokens per second) total time = 18865.39 ms / 5274 tokens slot release: id 0 | task 0 | stop processing: n\_tokens = 5273, truncated = 0 same quant, same card, same params... ohh I forgot, set it to 100W \+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.94 Driver Version: 560.94 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 WDDM | 00000000:01:00.0 On | N/A | | 54% 46C P2 39W / 100W | 11579MiB / 12288MiB | 0% Default | | | | N/A | \+-----------------------------------------+------------------------+----------------------+ but yeah i don't think that 10W will boost it to 30+ token/s

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

Posted by old-mike@reddit | LocalLLaMA | View on Reddit | 41 comments

nickless07@reddit

No worries. I can't replicate that with an 3060 too. Tested it, never managed to get anywhere close. Not much speed difference from using main llama.cpp and F16 KV vs. [buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) with turbo4 quant.

Is there any case of a less quantised smaller model outperforming a more quantised larger model?

Posted by opoot_@reddit | LocalLLaMA | View on Reddit | 25 comments

nickless07@reddit

Yeah, sometimes hard to find. I stumbled about them at [https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks)

Is there any case of a less quantised smaller model outperforming a more quantised larger model?

Posted by opoot_@reddit | LocalLLaMA | View on Reddit | 25 comments

nickless07@reddit

You mean things like this [https://kaitchup.substack.com/p/summary-of-qwen36-gguf-evals-updating](https://kaitchup.substack.com/p/summary-of-qwen36-gguf-evals-updating)

Are GPU prices hitting peak and falling?

Posted by DistanceSolar1449@reddit | LocalLLaMA | View on Reddit | 39 comments

nickless07@reddit

No, they already got more then enough money to run the whole company for decades without selling a single product. Why should they 'beg' at all.

Are GPU prices hitting peak and falling?

Posted by DistanceSolar1449@reddit | LocalLLaMA | View on Reddit | 39 comments

nickless07@reddit

You mean after Nvidia just did [that](https://www.tomshardware.com/tech-industry/artificial-intelligence/nvidia-no-longer-reports-sales-of-graphics-solutions-as-a-separate-segment-posts-eye-watering-usd81-6-billion-q1-profit-thanks-to-ai-boom)? Sure, you will get a small discount if you buy a couple thousand H200, but aside of that it will get worse.

qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

Posted by xxvegas@reddit | LocalLLaMA | View on Reddit | 11 comments

nickless07@reddit

Depends... With your initial settings you create a bottleneck on your PICe, KV on VRAM and compute on CPU? Well every token had to be send to VRAM, get quantisized to q8, send back, get processed and then stored again. Yes the VRAM is much faster, but the 6GB is a bit rough. 18-20tps is already descend. Have you tested KV on CPU too? or even don't use the GTX at all? [Here](https://www.reddit.com/r/LocalLLaMA/comments/1t2zapy/pushing_a_5yearold_6gb_vram_laptop_to_its_limits/) is some way more [in depth ](https://abhinandb.com/#/post/running-qwen-3-6-on-6gb-vram)information. This year we got some decent models that can perform great on old hardware.

qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

Posted by xxvegas@reddit | LocalLLaMA | View on Reddit | 11 comments

nickless07@reddit

Basically just change --n-cpu-moe until you have no more oom. Check how much you have left. More then 2GB? Then go for 24, less then 500mb? keep as is and so on. Check with nvidia-smi for nvidia cards or rocm-smi for AMD. What i'm missing at your load params is --threads and maybe mmproj (if used), aside of that it looks pretty good for that model, quant and VRAM size.

qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

Posted by xxvegas@reddit | LocalLLaMA | View on Reddit | 11 comments

nickless07@reddit

Try without MTP and offload some less layers to CPU. Right now only the KV, Vision tower, draft stack and some overhead is used by your 1060 everything else runs on your CPU.

Best solution to generate reports locally with graphs, charts? Beginner question.

Posted by NetZeroSun@reddit | LocalLLaMA | View on Reddit | 12 comments

nickless07@reddit

There are many thing out there that can do this. MCP - [https://lmstudio.ai/docs/app/mcp](https://lmstudio.ai/docs/app/mcp) then something like this [https://lmstudio.ai/bakit/ai-to-pdf/files/src/toolsProvider.ts](https://lmstudio.ai/bakit/ai-to-pdf/files/src/toolsProvider.ts) or if you want it aviable in your local network use something like [Open WebUI](https://openwebui.com) and so on.

Seeing the activity pop up big time in this sub due to various open models. Most of them require at least 16gb vram. What can I do with 8?

Posted by baked_tea@reddit | LocalLLaMA | View on Reddit | 13 comments

nickless07@reddit

Only 8 [https://www.reddit.com/r/LocalLLaMA/comments/1t2zapy/pushing\_a\_5yearold\_6gb\_vram\_laptop\_to\_its\_limits/](https://www.reddit.com/r/LocalLLaMA/comments/1t2zapy/pushing_a_5yearold_6gb_vram_laptop_to_its_limits/) Lucky you we have 2026.

Qwen3.6 35b-a3b 🤯

Posted by EffectiveMedium2683@reddit | LocalLLaMA | View on Reddit | 118 comments

nickless07@reddit

Yeah and whenever that one got stuck, I switched to the 122B to fix the problem. For Example is was working on a prompt with Gemma 4 for hours, a lot of back and forth, then was running in circles, switched to qwen3.5 and got it done in a single shot. It is just, my english isn't the best and where Gemma beats around the bush, the 122B qwen jumps in straight with the right phrases.

Qwen3.6 35b-a3b 🤯

Posted by EffectiveMedium2683@reddit | LocalLLaMA | View on Reddit | 118 comments

nickless07@reddit

Idk man. I can run only Qwen3.5-122B-A10B-UD-IQ2\_XXS with \~4 token/s and for the few runs I used it the writing style was much better then Qwen3.6 in q8. I know MoE suffer more then dense models from quants, but for me it was pretty descent even in that low quant. Then I read everywhere that it is bad with code (not my usecase at all) and it is hard to find any tests that doesn't aim for coding. I really hope we get that as Qwen3.6 too, however it looks like we are out of luck there.

Qwen3.6 35b-a3b 🤯

Posted by EffectiveMedium2683@reddit | LocalLLaMA | View on Reddit | 118 comments

TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui).

Posted by oobabooga4@reddit | LocalLLaMA | View on Reddit | 221 comments

nickless07@reddit

"Select a file that matches your model. Must be placed in ...user\_data/mmproj/" Where are the settings to change the default path for models, mmproj and so on?

LM Studio - 3 GPUs, one model per GPU as different servers

Posted by MarcusAurelius68@reddit | LocalLLaMA | View on Reddit | 15 comments

nickless07@reddit

Oh, you can run them all on the same port, the API is capable to handle @ model requests, that is why you get the API model identifier. Load model A on GPU 1 and model B on GPU 2 and then just send the requests to the desired model on the same port is not a big deal. However if you need different ports you also need different processes.

LM Studio - 3 GPUs, one model per GPU as different servers

Posted by MarcusAurelius68@reddit | LocalLLaMA | View on Reddit | 15 comments

nickless07@reddit

You need 3 server for that, so either run LM Studio 3 times, ollama 3 times or llama.cpp 3 times and so on. It was not possible before and not now with only a single instance.

Qwen3.6 35b-a3b 🤯

Posted by EffectiveMedium2683@reddit | LocalLLaMA | View on Reddit | 118 comments

What's the current best small model?

Posted by Conscious_Nobody9571@reddit | LocalLLaMA | View on Reddit | 52 comments

nickless07@reddit

Yeah first time I thought, how can the vision be worse then Gemma 3. Had to dig a bit more into it to get it working properly, but it is really good even with details and physics, sometimes handles it slightly better then Qwen.

Does 'preserve_thinking' work with openwebui?

Posted by sterby92@reddit | LocalLLaMA | View on Reddit | 33 comments

Does 'preserve_thinking' work with openwebui?

Posted by sterby92@reddit | LocalLLaMA | View on Reddit | 33 comments

nickless07@reddit

Looks a bit different now. { "role": "user", "content": "\[08/05/2026, Friday, 05:09:28 PM\]\\nhmm do you know whow to deal with \\"8197a522-c63f-4681-8ab0-58c558af5ef9\\" ?" }, { "role": "assistant", "content": "<details type=\\"reasoning\\" done=\\"false\\">\\n<summary>T... <Truncated in logs> ... ID. Let\&#x27;s check.\\n\&gt; I will call\\n</details>" }, { "role": "user", "content": "\[11/05/2026, Monday, 06:35:26 PM\]\\nhmm" } \], "tools": \[ I stopped generation mid thinking, so "role": "assistant" only contains CoT, no finished reply. However full content still get send. Perhaps a parsing error? https://preview.redd.it/lonfbtu5ej0h1.png?width=2391&format=png&auto=webp&s=597f2ebe859645418f75d1427a304797a649d6cd

Does 'preserve_thinking' work with openwebui?

Posted by sterby92@reddit | LocalLLaMA | View on Reddit | 33 comments

nickless07@reddit

0.92 and the full thing get send: Received request: POST to /v1/chat/completions with body  { "stream": true, "model": "qwen3.6-35b-a3b", "messages": \[ { "role": "user", "content": "hmm do you know whow to deal with \\"8197a522-c63f-4681-8ab0-58c558af5ef9\\" ?" }, { "role": "assistant", "content": "<think>The user is asking about a specific ID: \\"81... <Truncated in logs> ...not have this ID. Let's check.\\nI will call</think>" }, { "role": "user", "content": "hmm" } \], "tools": \[ Let me update and test again.

Does 'preserve_thinking' work with openwebui?

Posted by sterby92@reddit | LocalLLaMA | View on Reddit | 33 comments

nickless07@reddit

You content does look like more a high temp rather then 'preserved' anything. Maybe it will tell you locking in a number after 10 more turns too. Can you log the incoming token? As for me it works as expected with all reasoning content send back to the model each turn. I even wrote a script to stip the CoT as it bloated the ctx too much. [https://docs.openwebui.com/features/chat-conversations/chat-features/reasoning-models](https://docs.openwebui.com/features/chat-conversations/chat-features/reasoning-models)

What's the current best small model?

Posted by Conscious_Nobody9571@reddit | LocalLLaMA | View on Reddit | 52 comments

nickless07@reddit

It has dynamic vision [https://ai.google.dev/gemma/docs/capabilities/vision](https://ai.google.dev/gemma/docs/capabilities/vision) maybe up the limit?

Has anyone set a local LLM up as a language learning tool?

Posted by OrdoRidiculous@reddit | LocalLLaMA | View on Reddit | 24 comments

nickless07@reddit

Yeah, kinda. However especially with idioms even the cloud ones still have trouble. I would say test them with a couple phrases and see which one does best: "In der Not Frisst der Teufel Fliegen" - "Beggars can’t be choosers.") "Darauf gebe ich dir Brief und Siegel" or "Brief und Siegel geben" - "Under hand and seal" or "Signed and sealed" or "Under hand and seal" "Rostiges Dach, Feuchter Keller" - "Red in the head, fire in the bed" or similiar "Viele Hunde sind des Hasen Tod" - this is a pretty tough one, as it resembles multiple english phrases 'Strength in numbers', 'The odds are stacked against him', 'Overhwelmed by numbers'

Tools in Openwebui

Posted by Radiant-Giraffe5159@reddit | LocalLLaMA | View on Reddit | 14 comments

Tools in Openwebui

Posted by Radiant-Giraffe5159@reddit | LocalLLaMA | View on Reddit | 14 comments

nickless07@reddit

Let it list aviable tools. Maybe not all get exposed to the model? Log incoming token in LM Studio and check the log. It should give some json at the start with all aviable tools, system prompt and first user prompt. Make sure everything there is right, a missing description or similiar can affect the tool calls (e.g. it 'sees' the tool but don't know what it does) instruct it to make that exact tool call and let it help you debug it, why they fail and others don't. Maybe the API (of the weather) is misconfigured, maybe the problem is within the tool itself. Have you try'd that api calls by yourself? What do they return, how do they need to be formatted and so on. Feed the code into the llm, so it can help figuring out where it might break.

Tools in Openwebui

Posted by Radiant-Giraffe5159@reddit | LocalLLaMA | View on Reddit | 14 comments

How do you estimate total memory usage?

Posted by HornyGooner4402@reddit | LocalLLaMA | View on Reddit | 16 comments

How to change settings on llmster server?

Posted by FalconX88@reddit | LocalLLaMA | View on Reddit | 5 comments

nickless07@reddit

Ahh, yeah now i understand. Sorry my english isn't the best. [https://github.com/lmstudio-ai/lms/issues/489](https://github.com/lmstudio-ai/lms/issues/489) LM Studio is more a local thing, for some private use, for you and family/friends. I would recommend [llama-server](https://github.com/ggml-org/llama.cpp/tree/master/tools/server) for more control, or wait until LM Studio implemented more features.

How to change settings on llmster server?

Posted by FalconX88@reddit | LocalLLaMA | View on Reddit | 5 comments

How to change settings on llmster server?

Posted by FalconX88@reddit | LocalLLaMA | View on Reddit | 5 comments

Have Qwen said anything about further Qwen 3.6 models?

Posted by spaceman_@reddit | LocalLLaMA | View on Reddit | 61 comments

Anyone tried 2 different GPUs in one PC for local LLMs?

Posted by ShadowBannedAugustus@reddit | LocalLLaMA | View on Reddit | 21 comments

Open Models - April 2026 - One of the best months of all time for Local LLMs?

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 153 comments

Sorry if it's not the best place to ask this, of the models in the image, which is the best for (problem solving)/Coding and the best one for studying (ask LLM concepts) ? My PC build is RX 9060 XT 16GB + I3 12100F + 16 GB DDR4 + llama.cpp with Vulkan backend + Linux Mint.

Posted by Badhunter31415@reddit | LocalLLaMA | View on Reddit | 13 comments

nickless07@reddit

That should help a bit. [https://kaitchup.substack.com/p/summary-of-qwen36-gguf-evals-updating](https://kaitchup.substack.com/p/summary-of-qwen36-gguf-evals-updating)

If the AI bubble pops, will GPU prices increase or decrease?

Posted by Mashic@reddit | LocalLLaMA | View on Reddit | 39 comments

nickless07@reddit

Right after the Smartphone bubble pops... They already bought the whole production that the fabs will produce this year. Even if there is something that will pop there is no going back anymore.

Best Adventure Gaming Setup

Posted by thefool00@reddit | LocalLLaMA | View on Reddit | 11 comments

MLX's gone today in newest LM Studio

Posted by maciejb84@reddit | LocalLLaMA | View on Reddit | 3 comments

nickless07@reddit

Click the link - top right has a "Use this model" button, choose LM Studio from the dropdown and it should open in LM Studio ready to select your download.