anthonyg45157

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context?

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 79 comments

anthonyg45157@reddit

This is what I'm doing currently with my 3090 and was able to get q5 at 115k and the image model loaded in GPU was well. New CPP builds really help maximize vram

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 147 comments

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 147 comments

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 147 comments

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 147 comments

anthonyg45157@reddit

It's a cool experiment and good to see real world use case comparison... However the comparison is a little off kiltered because of the context window differences between the models

Stop asking what model to run. There are literally only two.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 549 comments

anthonyg45157@reddit

Damnit πŸ’€ I actually knew about that but didn't know the exact date ..few years before my time but definitely learned about it in school. I thought you were hinting at some top secret knowledge about a release coming up πŸ˜„

Stop asking what model to run. There are literally only two.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 549 comments

llama: use f16 mask for FA to save VRAM by am17an Β· Pull Request #23764 Β· ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 78 comments

For everyone that uses OpenCode / Pi - Heres your promptprocessing fix!

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 40 comments

anthonyg45157@reddit

How does this issue manifest or show itself in pi? I don't think I've had any issues with prompt processing but I haven't fed any super large files or anything recently

Move to backend sampling for MTP draft path by gaugarg-nv Β· Pull Request #23287 Β· ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 37 comments

anthonyg45157@reddit

I noticed this as well. I think it has something to do with draft p min defaulting to 0 and not being used in early builds but now it is so if you have that set it could be the issue...I'm still noticing some slowdown compared to the original merge on top of that it seems.

Thoughts on "production" model setups

Posted by fuse1921@reddit | LocalLLaMA | View on Reddit | 14 comments

Thoughts on "production" model setups

Posted by fuse1921@reddit | LocalLLaMA | View on Reddit | 14 comments

What llamacpp's webui has and what it lacks

Posted by gigachad_deluxe@reddit | LocalLLaMA | View on Reddit | 25 comments

What llamacpp's webui has and what it lacks

Posted by gigachad_deluxe@reddit | LocalLLaMA | View on Reddit | 25 comments

What llamacpp's webui has and what it lacks

Posted by gigachad_deluxe@reddit | LocalLLaMA | View on Reddit | 25 comments

Got MTP + TurboQuant running β€” Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 76 comments

Got MTP + TurboQuant running β€” Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 76 comments

anthonyg45157@reddit

Just checked it's 3, tried 5 and it seems worse Doing a simple prompt that pushes a decent amount of output Prompt: Make a rhyme about each state" 3 starts at 60 then drops quickly to 47 and hangs around there and will dip a little With 5 it starts around 45 then dips to 38 So 3 definitely seems better in my case but it's not much better than just running stock 38-40 tok/s VS 45-47

Got MTP + TurboQuant running β€” Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 76 comments

Got MTP + TurboQuant running β€” Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 76 comments

anthonyg45157@reddit

Been trying these on my 3090 and keep getting similar results... Decode stats around 60 then quickly dips to 50 the quickly to 40-45 which doesn't seem much better than regular non MTP...idk what I'm doing wrong lol

AMA with Nous Research -- Ask Us Anything!

Posted by emozilla@reddit | LocalLLaMA | View on Reddit | 399 comments

AMA with Nous Research -- Ask Us Anything!

Posted by emozilla@reddit | LocalLLaMA | View on Reddit | 399 comments

AMA with Nous Research -- Ask Us Anything!

Posted by emozilla@reddit | LocalLLaMA | View on Reddit | 399 comments

AMA with Nous Research -- Ask Us Anything!

Posted by emozilla@reddit | LocalLLaMA | View on Reddit | 399 comments

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

Posted by sandropuppo@reddit | LocalLLaMA | View on Reddit | 182 comments

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision β€” on One RTX 3090 | by Wasif Basharat | Apr, 2026

Posted by AmazingDrivers4u@reddit | LocalLLaMA | View on Reddit | 177 comments

Best config for Qwen3.6 27b / llama.cpp / opencode

Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 110 comments

anthonyg45157@reddit

Yup same which mean context is being shared with system ram I guess? Seems the 35b Moe is best for people who have a decent GPU but a ton of ram these dense models can be ran but are so slow (depending on your needs)

Best config for Qwen3.6 27b / llama.cpp / opencode

Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 110 comments

Best config for Qwen3.6 27b / llama.cpp / opencode

Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 110 comments

anthonyg45157@reddit

Hmmm I'm only getting 11 per second as well with my 3090... it seems Vram and system ram is being used..11ntok/s is pretty damn slow..should get around 30-40 on gpu ram only....idk what I'm missing lol

Best config for Qwen3.6 27b / llama.cpp / opencode

Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 110 comments

Waiting Qwen3.6-27B I have no nails left...

Posted by DOAMOD@reddit | LocalLLaMA | View on Reddit | 95 comments

Qwen3.6 is incredible with OpenCode!

Posted by CountlessFlies@reddit | LocalLLaMA | View on Reddit | 166 comments

GPU advice for Qwen 3.5 27B / Gemma 4 31B (dense) β€” aiming for 64K ctx, 30+ t/s

Posted by Fit-Courage5400@reddit | LocalLLaMA | View on Reddit | 96 comments

Car-wash question and Qwen3.5-27b-Q6

Posted by KringleKrispi@reddit | LocalLLaMA | View on Reddit | 20 comments

If you haven't yet given Gemma 4 a go...do it today

Posted by No-Anchovies@reddit | LocalLLaMA | View on Reddit | 206 comments

anthonyg45157@reddit

You might find this graphic interesting. I had Claude come up with 6 tests based on my personal memories and real world things I do. I tested Gemma and qwen thinking and non thinking just to see how they would answer each. Here are the token results Qwen 3.5 27b thinks sooooo much more than Gemma 4 , interestingly enough turning off thinking doesn't make it that much worse overall. https://preview.redd.it/kxg5ritwnoug1.jpeg?width=763&format=pjpg&auto=webp&s=bf195863b665996e8163bce81d4543ffa95098e7

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex !

Posted by cviperr33@reddit | LocalLLaMA | View on Reddit | 108 comments

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex !

Posted by cviperr33@reddit | LocalLLaMA | View on Reddit | 108 comments

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

anthonyg45157@reddit

Thank you for the information! This generally aligns with what I've noticed as well.... Are there any ways to speed up 27b with agentic type coding workflows? Maybe I just need to turn thinking off so it feels more responsive...

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

Local Claude Code with Qwen3.5 27B

Posted by FeiX7@reddit | LocalLLaMA | View on Reddit | 122 comments

I don’t get it. Why would Facebook acquire Moltbook? Are their engineers too busy recording a day in the life of a meta engineer and cannot build it in a week or so?!

Posted by SilverRegion9394@reddit | LocalLLaMA | View on Reddit | 84 comments

We collected 135 phrases Whisper hallucinates during silence β€” here's what it says when nobody's talking and how we stopped it

Posted by Aggravating-Gap7783@reddit | LocalLLaMA | View on Reddit | 95 comments

We collected 135 phrases Whisper hallucinates during silence β€” here's what it says when nobody's talking and how we stopped it

Posted by Aggravating-Gap7783@reddit | LocalLLaMA | View on Reddit | 95 comments

anthonyg45157@reddit

Very very near info, I recently noticed this when making a local transcription app, I really only ever noticed thank you probably because of the length of silence

PSA: Humans are scary stupid

Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 204 comments

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

anthonyg45157@reddit

I'm not here to really argue what openclaws target audience is or really argue at all lol... I simply said it helps bridge a gap and gets people more comfortable with AI and automation and might even coax them to play around in the terminal... You kinda just said that it, Claude code and cursor are not for regular people but openclaw can be and can do things outside of a chat interface..it bridges the gap Normies - Regular Chat Interface Openclaw - the bridge Devs/ Hobbists - Claude code /cursor CLI/IDE To be fair I don't think openclaw is some groundbreaking technology it's basically just a LLM with a bunch of accounts linked together. But the delivery and the way they are packaged together is why it's less scary than Claude code or cursor... The wires are hidden which drives adoption and bridges a gap.

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

anthonyg45157@reddit

I was being a little dramatic. I basically meant to explain how it blows my mind with many people I come into contact with that are familiar with tech, use it daily for work or pleasure and dont know how beneficial it can be in some aspects...

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

anthonyg45157@reddit

Seeing AI actually do stuff without needing to copy and paste from a chat terminal is the bridge and what open claw offers... It peaks peoples interest beyond a chat window Yes we could argue similar things can be done with Claude code or cursor but with openclaw it's more of an "all in one package" and many regular people like that.

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

anthonyg45157@reddit

The gap between not using AI and using AI...the gap between people embracing ai and the people not....was that not obvious from my comment? Working in web design and customer support realm it blows my mind how many people don't use or play around with AI. My scope my be narrow but I've even seen it on Reddit. Just last night someone asked me to help them with a prompt.. there is a gap

why is openclaw even this popular?

Posted by Crazyscientist1024@reddit | LocalLLaMA | View on Reddit | 320 comments

LM Link

Posted by Blindax@reddit | LocalLLaMA | View on Reddit | 40 comments