yeah-ok

Me visiting this sub

Posted by Scutoidzz@reddit | LocalLLaMA | View on Reddit | 139 comments

yeah-ok@reddit

Yup, for me AI has gone from being undependable/unsustainable cloud bs to basically real technology courtesy of llama.cpp and qwen3.6 models. Didn't rewire my house, bought a 780m with 32GB (for about $370 a bit over half a year ago when things were just starting to heat up) and hacked the improvements into private llama.cpp fork as needed. The Qwen3.6 models can -actually- code, of course it needs domain understanding, of course it's less powerful than Claude 4.x, but, but, it's as real as things get, it's runs locally, I've got a backup. Things are good.

More Gemma 4 models incoming

Posted by Deep-Vermicelli-4591@reddit | LocalLLaMA | View on Reddit | 165 comments

yeah-ok@reddit

Since I'm on the 32GB ride I'm rooting for a new 26B Gemma that -actually- manages to one-up the Qwen 3.6-35B model (and not in the I'm-sort-of-better-at-creative-writing type way)

llama: use f16 mask for FA to save VRAM by am17an · Pull Request #23764 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 78 comments

yeah-ok@reddit

Gotta say tho.. the latest builds are SIGNIFICANTLY slower for me than the one I built off am17an repo a while back.. on latest official I'm on (average: 17.2 tg/s) whereas prior custom build comes in at (average: 22.5 tg/s) - this is no-mtp run with basic ngram-map-k4v (4/18) prediction in place. 780m 32gb mini-pc.

I implemented Laguna (XS.2) as a model in Llama.cpp

Posted by linuxid10t@reddit | LocalLLaMA | View on Reddit | 10 comments

yeah-ok@reddit

Oh, please fill that model card out with some details on the model, even if you let Laguna fill in based on your prior blog post I'll forgive it. Just anything but blank, blank's the territory of drive-by HF posters that can't be trusted to be worth the bandwidth!

llama.cpp server have built-in native tools (exec_shell, edit_file, etc.)

Posted by srigi@reddit | LocalLLaMA | View on Reddit | 49 comments

yeah-ok@reddit

Another seldomly talked about feature flag is (after the hf integration) the "--offline" param, worth it for us taking-local-seriously ppl!

magic incantation to get llama-bench to work with MTP ?

Posted by jdchmiel@reddit | LocalLLaMA | View on Reddit | 8 comments

yeah-ok@reddit

This is the right reply, I do a "cmake ... -target llama-server" build these days to only get llama-server, the rest is noise in terms of what I need.

Quick note on sudden performance loss when running GGUFs

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 5 comments

yeah-ok@reddit (OP)

Yeah, it would be great to have some quick sanity checks and early/quick exit on startup rather than compromised generation! I would say it's worth checking "Files and versions" on HF to grab the sha256/xet hashes to check before re-download; may well save some time & bandwidth!

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Posted by janvitos@reddit | LocalLLaMA | View on Reddit | 109 comments

Qwen is cooking hard

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 225 comments

Qwen 3.7 droped on Qwen Chat

Posted by Foxiya@reddit | LocalLLaMA | View on Reddit | 221 comments

RDNA3 Flash Attention fix just dropped by llama.cpp b9158

Posted by Bulky-Priority6824@reddit | LocalLLaMA | View on Reddit | 10 comments

Is there a big gap between Q4 and Q6 on Qwen3.6?

Posted by vick2djax@reddit | LocalLLaMA | View on Reddit | 78 comments

yeah-ok@reddit

same exp, once I killed the ctk/ctv flags I never went back, better quality and oddly enough also quantity in the sense that my tgs went up rather than down (I'm on 780m-vulkan-linux so who knows, maybe atypical compared to regular cuda setup)

we really all are going to make it, aren't we? 2x3090 setup.

Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 164 comments

we really all are going to make it, aren't we? 2x3090 setup.

Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 164 comments

yeah-ok@reddit

I get the logic but isn't there a very real market here rather than niche within niche? Bet the lot of us would hoover up a consumer-only cards sold for ai use via kickstarter or similar in next to no time. It would be guaranteed cash moneys for a company willing to get something within 3090 territory going with 32gb at decent price point. Even the Chinese manufacturers could get in on this if they could get a clean supply...

we really all are going to make it, aren't we? 2x3090 setup.

Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 164 comments

yeah-ok@reddit

Well, let them fry I say; then they'll flipping understand that serving the global market is where real stability and long term investment should go rather than into unicorn dust that can evaporate quicker up the nose of a VC recipient than you can possibly imagine. Until the market get's this we are going to have to get creative but.. seeing what this community is doing already that shouldn't be too hard of a nut to crack!

we really all are going to make it, aren't we? 2x3090 setup.

Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 164 comments

yeah-ok@reddit

It is ridiculous though isn't it? It's the sheer speculative capacity of enterprise that makes this current situation a "win" for enterprise and a loss for the computer-owning population at large. Since the "people" in sheer numbers are so so so much more numerous and resilient a base as compared to the fickle structure of corporate enterprises (or even state-enterprises, doesn't matter really) there should 100% be a way to give the market evaluation to this market that it actually deserves. If anyone could crack this from a financing standpoint the consumer (and let's face it, the investors) could win massively big on it. And the end result would be far greater global resilience rather than having billions riding on one company or another... serving the multitude will always win in statistical terms over the unicorns when it comes to long term stability and payout (yes, the payout bit is where said finance wizardry needs to happen)

VS Code's new "Agents window" lets you use local AI models. Still requires an Internet connection and a Github Copilot plan (because we can't have nice things)

Posted by _wsgeorge@reddit | LocalLLaMA | View on Reddit | 73 comments

yeah-ok@reddit

One can pray and hope - the fork would really come into it's own then (it's already my daily driver but bet it would attract yet larger audience!)

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Posted by ai-infos@reddit | LocalLLaMA | View on Reddit | 80 comments

yeah-ok@reddit

I noticed that too, a lot of the MTP code was done as fast inline code and now that it's been made safer/proper it's become slower by quite the margin.

Decoupled Attention from Weights - Gemma 4 26B

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 26 comments

yeah-ok@reddit (OP)

Perhaps you are right. Before I've personally run and experimented with larql/vindex to gain practical experience of it I will withdraw from chat re it's perceived benefits or the lack of them!

Will there be any more Qwen3.6 series models?

Posted by cafedude@reddit | LocalLLaMA | View on Reddit | 102 comments

Will there be any more Qwen3.6 series models?

Posted by cafedude@reddit | LocalLLaMA | View on Reddit | 102 comments

Qwen3.6 35b-a3b 🤯

Posted by EffectiveMedium2683@reddit | LocalLLaMA | View on Reddit | 118 comments

yeah-ok@reddit

I've been strict on --no-reasoning lately and having plenty of success with one-shot programming extensions for pi agent..etc..etc, think we have to remember that the top-k is in a sense a selection out of what is already a latent thought process in the model.

Decoupled Attention from Weights - Gemma 4 26B

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 26 comments

yeah-ok@reddit (OP)

> MLPs You might well be right but isn't it being partially mitigated by the vindex format? I do understand that all MLPs are FFNs, but not all FFNs are MLPs ... still though this must be part of the equation.

How long for llama.cpp official support of MTP?

Posted by Manaberryio@reddit | LocalLLaMA | View on Reddit | 50 comments

vLLM ROCm has been added to Lemonade as an experimental backend

Posted by jfowers_amd@reddit | LocalLLaMA | View on Reddit | 93 comments

Decoupled Attention from Weights - Gemma 4 26B

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 26 comments

yeah-ok@reddit (OP)

I thought so too when I first engaged with the topic but the negative from a good amount of the audience on this thread put me off from pursuing any further. After more reading I still think the larql system is on to something novel and potentially awesome - one of the feedback points in this thread is that this is literally just RPC (see llamacpp docs if ignorant like me) but after more research this seems like a misunderstanding; RPC can not split attention from weights the way larql vindex format claims to do. Think there's something to be said for this whole effort and I'll stay tuned to what https://github.com/chrishayuk/larql gets up to.. who can't feel a tingle of excitement with commands such as those found under the "Run attention locally, FFN on another machine" headline on github...?

Decoupled Attention from Weights - Gemma 4 26B

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 26 comments

yeah-ok@reddit (OP)

OK, llama.cpp is a sprawling ecosystem indeed, never heard of it until today! So.. does it make sense performance wise to put weights somewhere else on the LAN and let my workstation handle the attention layer alone via RPC.. or is the performance penalty too high. Honestly never saw this discussed priorly, would love to see practical examples!

Decoupled Attention from Weights - Gemma 4 26B

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 26 comments

yeah-ok@reddit (OP)

One of the amazing outcomes of this is that low-ram high-compute consumer cards like the 12GB 5070 would essentially be way overpowered for most models since it suddenly "only" needs to run 2-4gb of attention layers. The rest could presumably sit under the table on a "cheap" external xeon with 128gb DDR4 to hold the weights!? Interconnect via highspeed regular tcp/ip over ethernet & bob could be your uncle.

Decoupled Attention from Weights - Gemma 4 26B

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 26 comments

yeah-ok@reddit (OP)

> RPC As far as I can make out (via https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md) RPC seem focused on running distributed, GPU, compute on the attention layer whereas this larql decoupling focus on keeping latency low by having GPU as primary and distributing the weights themselves onto x local devices (could also be internetscale but latency seem to kill that off at the moment).

Heretic 1.3 released: Reproducible models, integrated benchmarking system, reduced peak VRAM usage, broader model support, and more

Posted by -p-e-w-@reddit | LocalLLaMA | View on Reddit | 80 comments

yeah-ok@reddit

I understand this stance on a purely philosophical level but are there good benchmarks or similar to cooperate this point at scale?! I've seen some stuff published but nothing I really can refer to as a smoking gun.

PS5’s can now be hacked to run Linux - perhaps some potential for local inference?

Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 76 comments

PS5’s can now be hacked to run Linux - perhaps some potential for local inference?

Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 76 comments

Introducing the IBM Granite 4.1 family of models (3B/8B/30B)

Posted by abkibaarnsit@reddit | LocalLLaMA | View on Reddit | 40 comments

I'm done with using local LLMs for coding

Posted by dtdisapointingresult@reddit | LocalLLaMA | View on Reddit | 810 comments

yeah-ok@reddit

>> couldn't you just create embeddings for those 6, cluster them, create an embedding for the new email ... The key here is the phrasing, "just" might be a bit of stretch for most people, can you point to practical steps needed to do this (i.e. not theory or overview but actual terminal commands)?

Forgive my ignorance but how is a 27B model better than 397B?

Posted by No_Conversation9561@reddit | LocalLLaMA | View on Reddit | 286 comments

yeah-ok@reddit

I think "So it’s not just about Dense vs. MoE. It’s also about progress." is backed greatly up by the fact that the new Qwen3.6-35b-moe model is almost on par with the Qwen3.5-27b-dense which was claimed to be mega far ahead of the Qwen3.5-35b-moe due to it's denseness, that denseness has not been caught up with via moe improvements soooo. Yeah.

The Karpathy Loop - can we please get this running on llama-server (pointed at it's own code base)?!

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 2 comments

Web OS result from Qwen3.6 35B is by far the best I tested in my laptop

Posted by Idontknow3728@reddit | LocalLLaMA | View on Reddit | 24 comments

Asking the pertinent local LLM questions: "Is he alive or dead, has he thoughts within his head?"

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 1 comments

yeah-ok@reddit (OP)

Gotta say this makes for excellent background music as I'm tuning flags on llama-server for yet another vision-related experiment.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 117 comments

GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

Posted by zylskysniper@reddit | LocalLLaMA | View on Reddit | 151 comments

yeah-ok@reddit

Yeeeah.. but, have people actually done basic comparisons with say Kimi K2.5? In my testing, via Kilo Code, Kimi gives much better (concise and understandable) output when it comes to Golang projects whereas GLM5.1 uses 3 times more tokens and delivers over-complicated solutions. I'm well aware that this could be the harness playing up due to new-new model (or them needing very different prompting, in my tests I used exact same prompts) but.. it's a data point regardless and I'm for one putting GLM5.1 on the backburner for a while before I bother testing again.

Marco-Mini (17.3B, 0.86B active) and Marco-Nano (8B, 0.6B active) by Alibaba

Posted by AnticitizenPrime@reddit | LocalLLaMA | View on Reddit | 51 comments

yeah-ok@reddit

Is that clear cut truth? I know a lot of these things are more-so empirically verified these days due to the speed of how things are moving.. one would in a sense need very clear comparable training conditions to be able to call the shots on this.. but maybe it's been done and I'm just not in the know: would appreciate and all refs to be had!

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

Marco-Mini (17.3B, 0.86B active) and Marco-Nano (8B, 0.6B active) by Alibaba

Posted by AnticitizenPrime@reddit | LocalLLaMA | View on Reddit | 51 comments

yeah-ok@reddit

I still don't see why the multi-language push is so hard with all the models currently on the market. Get it really right in one language (English or Chinese) and all the rest can follow gradually - no need to spread thin with a product that lacks depth capability from the beginning.

Benchmarks of Radeon 780M iGPU with shared 128GB DDR5 RAM running various MoE models under Llama.cpp

Posted by AzerbaijanNyan@reddit | LocalLLaMA | View on Reddit | 26 comments

yeah-ok@reddit

Yeah, can tell you I would have liked a 64gb DDR5 7200Mhz low latency kit setup with the 8700G. Think the ability to overclock and optimize the mem/motherboard base frequency would absolutely rock versus the 7840/780m I got now (limited on the RAM front to 5600Mhz). Then RAM shortage happened 🤷‍

Tips: remember to use -np 1 with llama-server as a single user

Posted by ea_man@reddit | LocalLLaMA | View on Reddit | 44 comments

yeah-ok@reddit

Yup, I'm on Vulcan backend on both, ROCm has never worked for me for whatever perverse reason (days wasted trying, only to find that the best possible I could get out of it was both unstable and perf-wise less than what Vulcan delivers out of the box!)

Dual DGX Sparks vs Mac Studio M3 Ultra 512GB: Running Qwen3.5 397B locally on both. Here's what I found.

Posted by trevorbg@reddit | LocalLLaMA | View on Reddit | 243 comments

Tips: remember to use -np 1 with llama-server as a single user

Posted by ea_man@reddit | LocalLLaMA | View on Reddit | 44 comments

yeah-ok@reddit

Annoyingly this makes no difference on my AMD 780m, it's running Qwen3.5-35B-A3B-4KM (unsloth) at 21 TPS under Linux Mint with a unified mem at via amdgpu gttsize ... -np1 or none.. makes no difference. Oddly enough LM Studio manages to push 26 TPS with hardly any tweaking and no way can I get it replicated under llama-server ... argh.

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt

Posted by wadeAlexC@reddit | LocalLLaMA | View on Reddit | 76 comments

yeah-ok@reddit

Could it be hardware related?! I'm on a 32GB 780m integrated AMD system running with Vulcan under Linux and have -never- seen those particular issues!

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt

Posted by wadeAlexC@reddit | LocalLLaMA | View on Reddit | 76 comments

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt

Posted by wadeAlexC@reddit | LocalLLaMA | View on Reddit | 76 comments

yeah-ok@reddit

Honestly this, just forget about inference frameworks that are sufficiently complex to not be able to be reasoned about. llama.cpp is excellent for that particular one reason alone! Only additional thing to cultivate is a habit of putting settings (and prompts as depending on need) into sh/bat files so that it becomes super easy to test new versions and compare tok/sec etc.etc.