Enturbulated

Can Qwen3-235B-A22B run efficiently on my hardware(256gb ram+quad 3090s ) with vLLM?

Posted by Acceptable-State-271@reddit | LocalLLaMA | View on Reddit | 33 comments

[-]

Enturbulated@reddit

Swapping what's in GPU during run creates more lag than just using split CPU/GPU inferencing. Figuring out which bits should be loaded to GPU and which left in main RAM is the better route. There's been some posts about selective offloading. Fairly sure it's all been "load the smaller tensor sets to GPU, rest to CPU" which under llama.cpp would be adding ""--override-tensor (\[0-9\]+).ffn\_.\*\_exps.=CPU" to the command line. Beast of luck.

Qwen3-30B-A3B is magic.

Posted by thebadslime@reddit | LocalLLaMA | View on Reddit | 109 comments

[-]

Enturbulated@reddit

Would suggest waiting for imatrix data to become available, then I'd suggest a custom quant with the larger layers at q3\_k (or so) and others at higher precision. See how well that does for you, while leaving some RAM available.

Qwen 3 MoE making Llama 4 Maverick obsolete... 😱

Posted by Cool-Chemical-5629@reddit | LocalLLaMA | View on Reddit | 80 comments

[-]

Enturbulated@reddit

We can agree they should, but everyone and their haidresser's dog cheats on benchmarks in some fashion. Even if it's just a bit of ambiguity in labeling.

Quants are getting confusing

Posted by blaz3d7@reddit | LocalLLaMA | View on Reddit | 14 comments

[-]

Enturbulated@reddit

Uploads still in progress on some of those? Borked metadata?

Replace or upgrade 7yr old laptops?

Posted by BeanSticky@reddit | sysadmin | View on Reddit | 127 comments

[-]

Enturbulated@reddit

Win11 plus whatever productivity software, plus whatever corporate antimalware and device management tools on top of that on 8GB RAM? Yeah, no, eff that. Bumping to 16GB is the barest possible minimum. Evaluate onboard storage as well, see if faster storage would be helpful. As always, test, test, test.

Why, Microsoft? Why oh why don't you have drivers for Surface laptops in the windows ISO image?

Posted by kaiserh808@reddit | sysadmin | View on Reddit | 90 comments

[-]

Enturbulated@reddit

Symptom of a common disease within large orgs - various folks building their own feifdoms, intra-group rivalries, or even just a lack of communication between those who do know better but can't be fucked to act like it.

Have you tried a Ling-Lite-0415 MoE (16.8b total, 2.75b active) model?, it is fast even without GPU, about 15-20 tps with 32k context (128k max) on Ryzen 5 5500, fits in 16gb RAM at Q5. Smartness is about 7b-9b class models, not bad at deviant creative tasks.

Posted by -Ellary-@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

Enturbulated@reddit

Wouldn't mind if Ling-Plus saw an update as well.

Sand-AI releases Magi-1 - Autoregressive Video Generation Model with Unlimited Duration

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

Enturbulated@reddit

Performance trends over the last decade very strongly suggest that ain't gonna happen without some major fundamental changes in the very near future. Moore's Observation ('twas never a law) is no longer holding. There's still room to scale, but how much? Component shrink is running out of head room, cost of newer manufacturing processes keeps ballooning, economies of scale have their own limits.

Deepseek leak

Posted by OGScottingham@reddit | LocalLLaMA | View on Reddit | 30 comments

[-]

Enturbulated@reddit

TL;DR - Boilerplate "CyberSecurity is hard and this organization didn't do it right" article with a side order of "This model is particularly scary because it can be jailbroken to say bad things." Yeah, never assume any random org is competent, and run everything you can locally if you think you might need to.

Meta Perception Language Model: Enhancing Understanding of Visual Perception Tasks

Posted by ninjasaid13@reddit | LocalLLaMA | View on Reddit | 27 comments

[-]

Enturbulated@reddit

Great, add another variable for how to load the fridge. Optimizing for 'visibility of labels to camera' may well destroy efficient use of space!!1!

Why do we keep seeing new models trained from scratch?

Posted by live_love_laugh@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

Enturbulated@reddit

As I understand things (which is likely underinformed, to say the least) there's a good deal of exploration to be done for how to structure and tune the models. So expect orgs to continue throwing shit at the walls to see what sticks.

What OS are you ladies and gent running?

Posted by No-Report-1805@reddit | LocalLLaMA | View on Reddit | 74 comments

[-]

Enturbulated@reddit

Now if we could get release versions a bit more often instead of pretending it's a rolling release ; - ) Still, I'll gladly take that over dealing with how much effort some other software vendors expend for the sake of getting in the user (and admin's) way!

What OS are you ladies and gent running?

Posted by No-Report-1805@reddit | LocalLLaMA | View on Reddit | 74 comments

[-]

Enturbulated@reddit

At some point, years back, I just gave up on using RDP cross platform. VNC over SSH works well enough for basic usage ... most of the time anyway. Samba share for moving files as needed, mostly done.

What OS are you ladies and gent running?

Posted by No-Report-1805@reddit | LocalLLaMA | View on Reddit | 74 comments

[-]

Enturbulated@reddit

Slackware 4 Lyfe.

Amoral Gemma 3 - QAT

Posted by Reader3123@reddit | LocalLLaMA | View on Reddit | 31 comments

[-]

Enturbulated@reddit

Gemma-3 is a vision language model. It can ingest images, but not generate. Potentially useful for automatic captioning of images.

How to run Llama 4 fast, even though it's too big to fit in RAM

Posted by Klutzy-Snow8016@reddit | LocalLLaMA | View on Reddit | 67 comments

[-]

Enturbulated@reddit

Sorry, wasn't speaking to interaction batch size or other parameters, but in a more general sense. When a workload with non-compressible data reaches out and grabs most of your available memory, ZRAM swap can sometimes cause performance degradation. Just another variable to check for performance tuning.

How to run Llama 4 fast, even though it's too big to fit in RAM

Posted by Klutzy-Snow8016@reddit | LocalLLaMA | View on Reddit | 67 comments

[-]

Enturbulated@reddit

Apparently ZRAM swap is a bad match for LLM data and k/v cache data, as those aren't very compressible.

Back to Local: What’s your experience with Llama 4

Posted by Balance-@reddit | LocalLLaMA | View on Reddit | 49 comments

[-]

Enturbulated@reddit

Fair. Had already redone the conversion at least once with some earlier changes (and currently doing so again for the latest dsv3 changes), will do so again at next look.

Back to Local: What’s your experience with Llama 4

Posted by Balance-@reddit | LocalLLaMA | View on Reddit | 49 comments

[-]

Enturbulated@reddit

The biggest issue at last look, attention mechanisms not yet implemented. Was possible to run up to 8k token context window, past that was getting crashes. Sometimes before that as well.

Trump administration reportedly considers a US DeepSeek ban

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 238 comments

[-]

Enturbulated@reddit

As I commented on another forum a week or so ago... Was just pondering on Gibson as a prophet (a notion which annoys him) and how today's world has become entirely too cyberpunk. Corporate domination, me playing with (possibly soon to be samizdat) 'AI' models, recent news of work towards genetic resurrection of lost animals (Dire Wolves not being \*it\* yet, but certainly a step in that direction), and so on and so on. We got a lot of the bad points, but where's my neural boost and cranial cyberdeck?

Back to Local: What’s your experience with Llama 4

Posted by Balance-@reddit | LocalLLaMA | View on Reddit | 49 comments

[-]

Enturbulated@reddit

I primarily use llama.cpp, where llama4 support is still WIP. You can run it, but it's atrocious there. Not yet made time to play with other options.

IBM Power8 CPU?

Posted by An_Original_ID@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

Enturbulated@reddit

Relevant bit from wikipedia about power8 systems >"Each Memory Buffer chip has four interfaces allowing to use either [DDR3](https://en.wikipedia.org/wiki/DDR3) or [DDR4](https://en.wikipedia.org/wiki/DDR4) memory at 1600 MHz with no change to the processor link interface. The resulting 32 memory channels per processor allow peak access rate of 409.6 GB/s between the Memory Buffer chips and the DRAM banks. Initially support was limited to 16 GB, 32 GB and 64 GB DIMMs, allowing up to 1 TB to be addressed by the processor. Later support for 128 GB and 256 GB DIMMs was announced,[^(\[19\])](https://en.wikipedia.org/wiki/POWER8#cite_note-redp5137-19)[^(\[21\])](https://en.wikipedia.org/wiki/POWER8#cite_note-8A2232-21) allowing up to 4 TB per processor." Not sure how much the results could vary based on model range and what's in the thing for memory, but it may have some potential.

IBM Power8 CPU?

Posted by An_Original_ID@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

Enturbulated@reddit

Can't speak for other tools, but for llama.cpp the cmake setup mentions power10 and power9 as subtypes of powerpc64, and a few more generic catchalls that, as far as I can tell, should cover power8.

glm-4 0414 is out. 9b, 32b, with and without reasoning and rumination

Posted by matteogeniaccio@reddit | LocalLLaMA | View on Reddit | 90 comments

[-]

Enturbulated@reddit

First look, I'm getting 1952MiB total for 32k context with f16 k/v cache That's rather small. Will take some time to evaluate performance.

GLM-4-0414 - a THUDM Collection

Posted by Dark_Fire_12@reddit | LocalLLaMA | View on Reddit | 3 comments

[-]

Enturbulated@reddit

The model architecture (Glm4ForCausalLM) is not new, so many current tools should already support it. I'm converting to gguf right now to take a look.

The LLaMa 4 release version (not modified for human preference) has been added to LMArena and it's absolutely pathetic... 32nd place.

Posted by PauLBern_@reddit | LocalLLaMA | View on Reddit | 66 comments

[-]

Enturbulated@reddit

Wondering how many updates it's going to take before we see Scout and Maverick properly configured with the various runtimes actually supporting them properly. Only so many times people will re-download (or re-convert) a model for bad results before moving on.

“Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI“

Posted by rrryougi@reddit | LocalLLaMA | View on Reddit | 236 comments

[-]

Enturbulated@reddit

We'll see. Preliminary support was merged in llama.cpp today, currently playing with that. Results with default settings are disappointing, drop temp to zero and it gets better. Should be no surprise that there's not really anything yet for documentation on suggested inference settings. Probably need to do a fair amount of a/b testing. /sadpanda

What are your thoughts about the Llama 4 models?

Posted by internal-pagal@reddit | LocalLLaMA | View on Reddit | 119 comments

[-]

Enturbulated@reddit

Model uses \*different layers\* per activation / token generation, so you need to have as much of it loaded as possible, vram > ram > disk.

“Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI“

Posted by rrryougi@reddit | LocalLLaMA | View on Reddit | 236 comments

[-]

Enturbulated@reddit

If true, that's sad. I had hopes for a decent MoE in the general size range of Scout. Guess Meta really may have ... *screwed the llama* on this one.

Llama 4 is out and I'm disappointed

Posted by kaizoku156@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

Enturbulated@reddit

... please, do let us know what else you have to say. I'm curious as to your reasoning.

Llama 4 is out and I'm disappointed

Posted by kaizoku156@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

Enturbulated@reddit

There's some layer re-use, the models have a smaller total parameter count than you're thinking.

Llama 4 is out and I'm disappointed

Posted by kaizoku156@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

Enturbulated@reddit

Pretty much only using llama.cpp right now.

Llama 4 was a giant disappointment, let's wait for Qwen 3.

Posted by CreepyMan121@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

Enturbulated@reddit

Possible that providers aren't running optimal settings yet - that's happened enough times with other model releases. Not seeing settings in the model card, which should be standard, and not gone looking for whitepapers yet. If there's no better answer in a few days or a week, that would be sad.

Llama 4 is out and I'm disappointed

Posted by kaizoku156@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

Enturbulated@reddit

"Not even locally runnable" will vary. Scout should fit in under 60GB RAM, though waiting to see how well it runs for me and how the benchmarks line up with end user experience. Hopefully it isn't bad ... give it time to see.

Llama 4 was a giant disappointment, let's wait for Qwen 3.

Posted by CreepyMan121@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

Enturbulated@reddit

If you're going to criticize, you really should \*know\* what you're talking about and insult the model based on actual merit, or lack thereof. You've acknowledged a few corrections to your assumptions in thread so far, and that's good, keep it up! Just remember when discussing the tradeoffs made for any particular model, your use case and constraints won't always match others.

Llama 4 was a giant disappointment, let's wait for Qwen 3.

Posted by CreepyMan121@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

Enturbulated@reddit

Memory quantity, memory bandwidth, and available FLOPS can vary a great deal between devices. And in general GPU/VRAM is more expensive than CPU/RAM right now, moreso if one already has the latter on-hand. You may think it's a corner case, but a 109x17B MoE is probably a better fit than a dense 32B model on more machines than you'd expect.

Llama 4 was a giant disappointment, let's wait for Qwen 3.

Posted by CreepyMan121@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

Enturbulated@reddit

Proposed scaling law for comparing MoE models to dense models - sqrt(MoE\_Total \* MoE\_Active) = Dense\_Total So 109B with 17B active should be roughly about the same 'smarts' as a densely trained 43B parameter model while having somewhat faster performance. There's a lot of wiggle room in that estimate though, so any solid answers will have to wait until more people have shared their results.

3 bit llama 4 (109B) vs 4 bit llama 3.3 (70B)

Posted by Glittering-Bag-4662@reddit | LocalLLaMA | View on Reddit | 8 comments

[-]

Enturbulated@reddit

Quants aren't out yet. Check back in 10 or 20 milliseconds.

Llama 4 announced

Posted by nderstand2grow@reddit | LocalLLaMA | View on Reddit | 75 comments

[-]

Enturbulated@reddit

The scout model should be \~60GB at Q4. MoE means it'll be faster on CPU than some would expect. Will be a bit to see exact performance, and testing required to see how well it takes quantization. Yeah, yeah, RAM isn't free but it's a hell of a lot cheaper than VRAM right now.

Llama 4 announced

Posted by nderstand2grow@reddit | LocalLLaMA | View on Reddit | 75 comments

[-]

Enturbulated@reddit

The Scout model falls right into the general range I've been looking for, at 109B params and MoE. Show. Me. The. Benchmarks.

Any good options for running a local LLM that can analyze a directory of images and summarize them like this? (Gemini 2.5)

Posted by LegendOfAB@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

Enturbulated@reddit

Given today's release of the lighter-weight gemma3 quantization-aware trained 27b model, I'm messing about with that a bit under llama.cpp. So far it seems to do okay generating image descriptions. Will be trivial to script it to caption a folder of images. Building a workflow to classify images? May have just given myself a new project.

Bailing Moe is now supported in llama.cpp

Posted by MaruluVR@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

Enturbulated@reddit

There have been a few fixes posted for various issues since last posting... rope scaling (not yet tested) [https://github.com/ggml-org/llama.cpp/pull/12678](https://github.com/ggml-org/llama.cpp/pull/12678) tokenizer behavior (seems to fix imatrix calcs) [https://github.com/ggml-org/llama.cpp/pull/12677](https://github.com/ggml-org/llama.cpp/pull/12677) compute\_imatrix: tokenizing the input .. compute\_imatrix: tokenization took 310.475 ms compute\_imatrix: computing over 246 chunks with batch\_size 512 compute\_imatrix: 14.71 seconds per pass - ETA 1 hours 0.28 minutes \[1\]6.7043, woohoo!

Bailing Moe is now supported in llama.cpp

Posted by MaruluVR@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

Enturbulated@reddit

Ling Plus is holding up okay with what I can test so far... Playing with a custom 120GB quant ranging between q3\_k and q6\_k depending on layer type and it's not getting too incoherent. Bumping the larger layers up to q4\_k (or just using a standard q4\_k quant) takes it from 'slow' to 'glacial' on my hardware. Taking the larger layers down to q2\_k does make it noticeably dumber. If or when we can get imatrix data, that'll give some wiggle room for further optimization.

Bailing Moe is now supported in llama.cpp

Posted by MaruluVR@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

Enturbulated@reddit

For Ling-Lite, after some testing playing with rope scaling options, all I can get it to do is drop tokens and be less coherent than normal. welp.

Bailing Moe is now supported in llama.cpp

Posted by MaruluVR@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

Enturbulated@reddit

Getting the same problem with lite and plus models. Never done imat calcs before, no idea what's going on. Grar.

Bailing Moe is now supported in llama.cpp

Posted by MaruluVR@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

Enturbulated@reddit

Now to see if team mradermacher gets that sweet, sweet imatrix.dat posted before I can finish the calcs on it. (Spoiler: They probably will. This is not going quickly for me.)

Bailing Moe is now supported in llama.cpp

Posted by MaruluVR@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

Enturbulated@reddit

You can try extending context with RoPE. Note YaRN disabled for this model. Best of luck.

LLMs over torrent

Posted by aospan@reddit | LocalLLaMA | View on Reddit | 46 comments

[-]

Enturbulated@reddit

Binary patch on model files seems like it might not save much transfer? Unless people get into the habit of distributing finetunes as LoRA, but I'm told that has its' own issues.

Best LLM to run locally on integrated graphics?

Posted by Typical-Length-1405@reddit | LocalLLaMA | View on Reddit | 17 comments

[-]

Enturbulated@reddit

Hrm, interesting. Might have to take a look again at some point.

Best LLM to run locally on integrated graphics?

Posted by Typical-Length-1405@reddit | LocalLLaMA | View on Reddit | 17 comments

[-]

Enturbulated@reddit

There are options to use Intel integrated graphics for LLM inferencing, but because you're still dealing with the same memory bandwidth as CPU, don't expect any performance boost. It may even be slower, depending on tool selection and configuration. Still, if you really need to offload the work from CPU for some reason, it might help for that. Been a while since I looked at this ... If you're interested maybe start here ( [https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) ) and look over the options. If not using llama.cpp, best of luck.