anthonyg45157

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context?

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 79 comments

[-]

anthonyg45157@reddit

This is what I'm doing currently with my 3090 and was able to get q5 at 115k and the image model loaded in GPU was well. New CPP builds really help maximize vram

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 147 comments

[-]

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 147 comments

[-]

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 147 comments

[-]

anthonyg45157@reddit

Any specific changes you're referring to or just in general I know there have been a lot of llama CPP updates recently

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 147 comments

[-]

anthonyg45157@reddit

It's a cool experiment and good to see real world use case comparison... However the comparison is a little off kiltered because of the context window differences between the models

Stop asking what model to run. There are literally only two.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 549 comments

[-]

Damnit 💀 I actually knew about that but didn't know the exact date ..few years before my time but definitely learned about it in school. I thought you were hinting at some top secret knowledge about a release coming up 😄

Stop asking what model to run. There are literally only two.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 549 comments

[-]

anthonyg45157@reddit

What happened

llama: use f16 mask for FA to save VRAM by am17an · Pull Request #23764 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 78 comments

[-]

anthonyg45157@reddit

Damn this might just let me pull the image model back to GPU. I offloaded to gpu to maximize context with MTP

For everyone that uses OpenCode / Pi - Heres your promptprocessing fix!

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

anthonyg45157@reddit

How does this issue manifest or show itself in pi? I don't think I've had any issues with prompt processing but I haven't fed any super large files or anything recently

Move to backend sampling for MTP draft path by gaugarg-nv · Pull Request #23287 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 37 comments

[-]

anthonyg45157@reddit

I noticed this as well. I think it has something to do with draft p min defaulting to 0 and not being used in early builds but now it is so if you have that set it could be the issue...I'm still noticing some slowdown compared to the original merge on top of that it seems.

Thoughts on "production" model setups

Posted by fuse1921@reddit | LocalLLaMA | View on Reddit | 14 comments

[-]

anthonyg45157@reddit

Word! Appreciate the input

Thoughts on "production" model setups

Posted by fuse1921@reddit | LocalLLaMA | View on Reddit | 14 comments

[-]

anthonyg45157@reddit

Curious on your hands on experience with these models with Q4 vs full precision...what kv cache quant do you run if any? Any thoughts?

What llamacpp's webui has and what it lacks

Posted by gigachad_deluxe@reddit | LocalLLaMA | View on Reddit | 25 comments

[-]

anthonyg45157@reddit

Any specifics that are recommended I see a few

What llamacpp's webui has and what it lacks

Posted by gigachad_deluxe@reddit | LocalLLaMA | View on Reddit | 25 comments

[-]

anthonyg45157@reddit

Oh dang thankyou!!!

What llamacpp's webui has and what it lacks

Posted by gigachad_deluxe@reddit | LocalLLaMA | View on Reddit | 25 comments

[-]

anthonyg45157@reddit

Agreed. I wish open web UI had a context count like llama.cpp Cpp UI is so quick , minimal but fast

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

anthonyg45157@reddit

Very good point actually I need to compare more with the same prompts with a good mix of real work.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

anthonyg45157@reddit

Just checked it's 3, tried 5 and it seems worse Doing a simple prompt that pushes a decent amount of output Prompt: Make a rhyme about each state" 3 starts at 60 then drops quickly to 47 and hangs around there and will dip a little With 5 it starts around 45 then dips to 38 So 3 definitely seems better in my case but it's not much better than just running stock 38-40 tok/s VS 45-47

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

anthonyg45157@reddit

Tried 3 and 4 but I'm learning and idk if 4 is even a valid option 😂

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

anthonyg45157@reddit

Been trying these on my 3090 and keep getting similar results... Decode stats around 60 then quickly dips to 50 the quickly to 40-45 which doesn't seem much better than regular non MTP...idk what I'm doing wrong lol

AMA with Nous Research -- Ask Us Anything!

Posted by emozilla@reddit | LocalLLaMA | View on Reddit | 399 comments

[-]

anthonyg45157@reddit

I understand that just sharing my thoughts on how they feel with hermes in case others don't realize this is apparent

AMA with Nous Research -- Ask Us Anything!

Posted by emozilla@reddit | LocalLLaMA | View on Reddit | 399 comments

[-]

anthonyg45157@reddit

This is the way!

AMA with Nous Research -- Ask Us Anything!

Posted by emozilla@reddit | LocalLLaMA | View on Reddit | 399 comments

[-]

anthonyg45157@reddit

Damn gonna need Hermes to summarize all these comments 🤣

AMA with Nous Research -- Ask Us Anything!

Posted by emozilla@reddit | LocalLLaMA | View on Reddit | 399 comments

[-]

anthonyg45157@reddit

Qwen 3.6 ,27b or 35b. 35b feels so much more responsive

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

Posted by sandropuppo@reddit | LocalLLaMA | View on Reddit | 182 comments

[-]

anthonyg45157@reddit

Yeah I noticed that too Click comments and you'll see it all

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

Posted by AmazingDrivers4u@reddit | LocalLLaMA | View on Reddit | 177 comments

[-]

anthonyg45157@reddit

RemindMe 3 days "check this out again"

Best config for Qwen3.6 27b / llama.cpp / opencode

Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 110 comments

[-]

anthonyg45157@reddit

Yup same which mean context is being shared with system ram I guess? Seems the 35b Moe is best for people who have a decent GPU but a ton of ram these dense models can be ran but are so slow (depending on your needs)

Best config for Qwen3.6 27b / llama.cpp / opencode

Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 110 comments

[-]

anthonyg45157@reddit

Gonna to check into it more working and tinkering at the same time is rough 😂

Best config for Qwen3.6 27b / llama.cpp / opencode

Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 110 comments

[-]

anthonyg45157@reddit

Hmmm I'm only getting 11 per second as well with my 3090... it seems Vram and system ram is being used..11ntok/s is pretty damn slow..should get around 30-40 on gpu ram only....idk what I'm missing lol

Best config for Qwen3.6 27b / llama.cpp / opencode

Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 110 comments

[-]

anthonyg45157@reddit

what system? this runs so slow on my 3090 but it seems its setup to split with system ram

Waiting Qwen3.6-27B I have no nails left...

Posted by DOAMOD@reddit | LocalLLaMA | View on Reddit | 95 comments

[-]

anthonyg45157@reddit

I expect 3.6 27b to be slow as it will be a dense model but I wonder if and how they can improve speed.

Qwen3.6 is incredible with OpenCode!

Posted by CountlessFlies@reddit | LocalLLaMA | View on Reddit | 166 comments

[-]

anthonyg45157@reddit

Damn I'm running the UD-Q4_K_XL and fighting context 😂 ight need to switch

GPU advice for Qwen 3.5 27B / Gemma 4 31B (dense) — aiming for 64K ctx, 30+ t/s

Posted by Fit-Courage5400@reddit | LocalLLaMA | View on Reddit | 96 comments

[-]

anthonyg45157@reddit

How do you have this setup sharing context with RAM? Can you share your settings if using llama CPP or something similar

Car-wash question and Qwen3.5-27b-Q6

Posted by KringleKrispi@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

anthonyg45157@reddit

Do no thinking....

If you haven't yet given Gemma 4 a go...do it today

Posted by No-Anchovies@reddit | LocalLLaMA | View on Reddit | 206 comments

[-]

anthonyg45157@reddit

You might find this graphic interesting. I had Claude come up with 6 tests based on my personal memories and real world things I do. I tested Gemma and qwen thinking and non thinking just to see how they would answer each. Here are the token results Qwen 3.5 27b thinks sooooo much more than Gemma 4 , interestingly enough turning off thinking doesn't make it that much worse overall. https://preview.redd.it/kxg5ritwnoug1.jpeg?width=763&format=pjpg&auto=webp&s=bf195863b665996e8163bce81d4543ffa95098e7

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex !

Posted by cviperr33@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

anthonyg45157@reddit

Having very good results with these tips

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex !

Posted by cviperr33@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

anthonyg45157@reddit

Trying these now, been having looping with standard settings even with the jinja template

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

[-]

anthonyg45157@reddit

Thank you for the information! This generally aligns with what I've noticed as well.... Are there any ways to speed up 27b with agentic type coding workflows? Maybe I just need to turn thinking off so it feels more responsive...

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

[-]

anthonyg45157@reddit

Was about to download but you cpp comment saved me 😆 Can't wait for this to progress more, I'm torn between 27b and 35 with a 3090

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

[-]

anthonyg45157@reddit

Wen wen

Local Claude Code with Qwen3.5 27B

Posted by FeiX7@reddit | LocalLLaMA | View on Reddit | 122 comments

[-]

anthonyg45157@reddit

Any more info on how this orchestration layer is setup ?

I don’t get it. Why would Facebook acquire Moltbook? Are their engineers too busy recording a day in the life of a meta engineer and cannot build it in a week or so?!

Posted by SilverRegion9394@reddit | LocalLLaMA | View on Reddit | 84 comments

[-]

anthonyg45157@reddit

Data.... simple

We collected 135 phrases Whisper hallucinates during silence — here's what it says when nobody's talking and how we stopped it

Posted by Aggravating-Gap7783@reddit | LocalLLaMA | View on Reddit | 95 comments

[-]

anthonyg45157@reddit

Thankyou for the tip I will check it out

We collected 135 phrases Whisper hallucinates during silence — here's what it says when nobody's talking and how we stopped it

Posted by Aggravating-Gap7783@reddit | LocalLLaMA | View on Reddit | 95 comments

[-]

anthonyg45157@reddit

Very very near info, I recently noticed this when making a local transcription app, I really only ever noticed thank you probably because of the length of silence

PSA: Humans are scary stupid

Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 204 comments

[-]

anthonyg45157@reddit

Upvote must be true

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

[-]

anthonyg45157@reddit

I'm not here to really argue what openclaws target audience is or really argue at all lol... I simply said it helps bridge a gap and gets people more comfortable with AI and automation and might even coax them to play around in the terminal... You kinda just said that it, Claude code and cursor are not for regular people but openclaw can be and can do things outside of a chat interface..it bridges the gap Normies - Regular Chat Interface Openclaw - the bridge Devs/ Hobbists - Claude code /cursor CLI/IDE To be fair I don't think openclaw is some groundbreaking technology it's basically just a LLM with a bunch of accounts linked together. But the delivery and the way they are packaged together is why it's less scary than Claude code or cursor... The wires are hidden which drives adoption and bridges a gap.

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

[-]

anthonyg45157@reddit

I was being a little dramatic. I basically meant to explain how it blows my mind with many people I come into contact with that are familiar with tech, use it daily for work or pleasure and dont know how beneficial it can be in some aspects...

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

[-]

anthonyg45157@reddit

Seeing AI actually do stuff without needing to copy and paste from a chat terminal is the bridge and what open claw offers... It peaks peoples interest beyond a chat window Yes we could argue similar things can be done with Claude code or cursor but with openclaw it's more of an "all in one package" and many regular people like that.

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

[-]

anthonyg45157@reddit

The gap between not using AI and using AI...the gap between people embracing ai and the people not....was that not obvious from my comment? Working in web design and customer support realm it blows my mind how many people don't use or play around with AI. My scope my be narrow but I've even seen it on Reddit. Just last night someone asked me to help them with a prompt.. there is a gap

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

[-]

anthonyg45157@reddit

It's bridging the gap for many people and some are just having fun. Yes some hype but the bridging the gap is the biggest benefit I see

LM Link

Posted by Blindax@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

anthonyg45157@reddit

So dope! Now they need a phone app