viperx7

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context?

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 77 comments

viperx7@reddit

I have llama.cpp running qwen 27B Q8 with image support and MTP 4 with 220k ctx (kv cache isn't quantised) With 48GB VRAM if you drop image support you can squeez 250k CTX If you drop MTP then you can get 262k ctx with image support on same hardware But in my opinion sacrificing 42k ctx for twice the speed is worth it

High VRAM local coding model — still Qwen 3.6 27B?

Posted by Generic_Name_Here@reddit | LocalLLaMA | View on Reddit | 138 comments

viperx7@reddit

35B is a good model but it occasionally misses somethings especially for long context tasks, and for tasks that requires understanding an existing large project and adding new feature. 35B will do the thing but take more time and more back and forth despite being faster. So in long term tasks for agentic coding I will take 27B @ 30t/a over 35B@100t/s. I will consider 35B if I have to parse through huge amount of data like giving it a huge document and just asking it questions.

High VRAM local coding model — still Qwen 3.6 27B?

Posted by Generic_Name_Here@reddit | LocalLLaMA | View on Reddit | 138 comments

viperx7@reddit

So true, they really cooked with the 27B. If intelligence per parameter was some metric this one would be at the top The only limitation was that it was slow but now even that seems to be going away , can run this beast of a model at 100t/s

Switched from OpenCode to Pi - What Settings/Plugins would you recommend?

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 105 comments

viperx7@reddit

I don't like that most of those temporary commits will never be used and the git repo has to carry those . and they will presists from one session to other. I do realize that I can just tell the agent to Clea those but this is such a default and idiomatic feature I think they should ship a good version that works Plus what if you are not working on a git related thing maybe you ash to make some change to your zshrc file and decided you would like to revert

Switched from OpenCode to Pi - What Settings/Plugins would you recommend?

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 105 comments

Switched from OpenCode to Pi - What Settings/Plugins would you recommend?

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 105 comments

viperx7@reddit

Well I will give it a try I think whatever agent you use this should be default. Pi agent seems a little too modular and its true that I can just build the feature I like That should be the case for new things I want specific to me and not for very common and established features like rewind. What's the point if everyone has to reinvent thier own wheel

Switched from OpenCode to Pi - What Settings/Plugins would you recommend?

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 105 comments

viperx7@reddit

Aren't we using pi cause it's efficient, this is just the most inefficient way to handle this It does nothing and consumes context what a waste

Switched from OpenCode to Pi - What Settings/Plugins would you recommend?

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 105 comments

viperx7@reddit

As I said it is trash and adds unnecessary commits to the git repo And outright doesn't work with non git projects Tried them all

Switched from OpenCode to Pi - What Settings/Plugins would you recommend?

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 105 comments

viperx7@reddit

I just hate not being able to revert file changes and all plugins that implement this mess with your git repo I just dropped pi after that

How long for llama.cpp official support of MTP?

Posted by Manaberryio@reddit | LocalLLaMA | View on Reddit | 50 comments

viperx7@reddit

I just merged the MTP branch with the master and resolved the conflicts using codex. And it works it works very well speed ranges between 80-110 t/s for 27B Q8 with 220k context

Google is making local AI available to mainstream users ;)

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 161 comments

Google is making local AI available to mainstream users ;)

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 161 comments

viperx7@reddit

Maybe they want to offload the processing to user devices cause processing everypage every user visits ever on datacenters will be wildly expensive. And forcing users to do all the processing will will give them data they want and electricity cost can be offloaded to user.. So now you won't just be paying with your data but in electricity cost as well all the while they get to collect more precise data to sell to advertisers

Qwen3.6-27B vs Coder-Next

Posted by Signal_Ad657@reddit | LocalLLaMA | View on Reddit | 158 comments

viperx7@reddit

Theoritically they should be the same its just that FP8 models - support all the new features with vllm and slang - Like MTP ,dflash - you don't h ave to wait for llama cpp to implement the model you can start using right away - until recently you are almost guranteed to get faster speeds with FP8 (this might need some clarification ) Disadvantage is - you can't offload layers to cpu so no mixed infrence (so you better have lots of VRAM) - Vllm takes ages to load the model into memory

Qwen3.6-27B vs Coder-Next

Posted by Signal_Ad657@reddit | LocalLLaMA | View on Reddit | 158 comments

viperx7@reddit

I think the most important and meaningful metric is given X amount of VRAM what's the best results one can get Answer to this can be something like for 24gb vram - most intelligence at reasonable speed : Qwen3.6 27B - fastest tokens per second : Qwen 3.6 35BMoe - all the intelligence I can get at all cost. : maybe minimax with half of the model offloaded to cpu

Qwen3.6-27B vs Coder-Next

Posted by Signal_Ad657@reddit | LocalLLaMA | View on Reddit | 158 comments

viperx7@reddit

I believe qwen3.6 27B is so small that it makes sense to run it on FP8 Maybe that will do better given 4bit qwen-coder-next takes 45-49 GB I think you should compare with 27B Q8 which is 28GB Maybe it will be a little fair then

Qwen3.6-27B vs Coder-Next

Posted by Signal_Ad657@reddit | LocalLLaMA | View on Reddit | 158 comments

viperx7@reddit

For someone like you who is drowning in VRAM it might seem so but for most people its not how things work For example: Even if someone has 48GB VRAM the choice they face is - Qwen 3.6 27B @ Q8 with 264k unquantized context - Qwen 3 coder next @ Q4 and still offloading to cpu and maybe they can do 264 context when choosing the coder next - prompt processing will suffer - it wont be as smart as your version that you tested at Q8 and this is best case scenario a lot of people dont have 48GB vram and try running these models on a 24GB VRAM machine and then we will talk how far your shipping things go. And if you are not mentioning which Quant you are using for model and context you are using you can take your 20hours on your RTX 6000 PRO and get lost because it doesnt mean shit yes that 20hours of testing is pointless (being a little mean just because you are mean with your meme)

Actual comparison between locally ran Qwen-3.6-27B and proprietary models

Posted by netikas@reddit | LocalLLaMA | View on Reddit | 72 comments

viperx7@reddit

I run qwen3.6 27B on 48GB VRAM and it is very capable and every now and then it surprises me with things it can do For a 48GB VRAM setup this is just way too good you get Q8 model with full context no context quantisation +vision

Devs using Qwen 27B seriously, what's your take?

Posted by Admirable_Reality281@reddit | LocalLLaMA | View on Reddit | 240 comments

viperx7@reddit

I’ve been using this setup for a variety of tasks over the past couple of weeks, and honestly, it just works. That said, I’m still a little hesitant—mainly because it’s a local model and 27B is definitely smaller than what Opus is running on but the more I use it, the more I realize I might be discriminating against it just because it’s running on my own hardware. I catch myself over-monitoring prompts and outputs because I’m subconsciously worried about it making mistakes… which, ironically, it hardly ever does. **My setup:** * **Model:** Qwen 3.6 27B (Q8 quant) * **Context:** 262k tokens (`ctk` format, no context compression) I’ll be the first to admit that cloud models have the edge on wildly complex or highly specialized problems or things that require a lot of knowledge. But I’m not solving quantum puzzles every day, and for my actual workflow, this local setup has been more than enough. I mainly use the model for Agentic workflow and coding.

Duality of r/LocalLLaMA

Posted by HornyGooner4402@reddit | LocalLLaMA | View on Reddit | 122 comments

Qwen 3.6 27B in Claude Code says it will do something then stops and prompts for user reply (not failing a tool call)

Posted by jettoblack@reddit | LocalLLaMA | View on Reddit | 29 comments

viperx7@reddit

bro this FP8 model is so annoying it was very painful to get it to work and still it had a lot of stupid issues like this one. the issue you are facing is due to some template shenanigans there was just so much pain that i went back to running the Q8 version with ikllama cpp with sm graph gives 42t/s or so but at least it works if you are still looking for more pain [https://github.com/allanchan339/vLLM-Qwen3.5-27B](https://github.com/allanchan339/vLLM-Qwen3.5-27B) this is a guide somebody wrote but be warned it will solve almost all the issues but you will still occasionally see slowdowns vllm options that worked for me ``` "Qwen3.6 27B FP8": description: "vllm FP8 ⭐" env: - "CUDA_VISIBLE_DEVICES=0,1,2" - "CUDA_VISIBLE_DEVICES=0,1" - "VLLM_WORKER_MULTIPROC_METHOD=spawn" - "NCCL_P2P_DISABLE=1" - "VLLM_TEST_FORCE_FP8_MARLIN=1" - "VLLM_USE_FLASHINFER_SAMPLER=1" - "VLLM_ALLOW_LONG_MAX_MODEL_LEN=1" - "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512" cmd: | vllm serve ${models_path}/Qwen3.6/27B/Qwen3.6-27B-FP8/ --enable-prefix-caching --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --max-num-seqs 2 --max-num-batched-tokens 8192 --trust-remote-code --enable-auto-tool-choice --enable-chunked-prefill --enable-force-include-usage --no-scheduler-reserve-full-isl --host 0.0.0.0 --port ${PORT} --served-model-name "Qwen3.6 27B FP8" --max-model-len 125000 --dtype bfloat16 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens" : 8}' --chat-template qwen3.5-enhanced.jinja --default-chat-template-kwargs '{"preserve_thinking": false}' --override-generation-config '{"temperature": 1.0, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty":0.0, "repetition_penalty":1.0}' # --max-model-len 219520 # --language-model-only checkEndpoint: /health ttl: 6000 aliases: - "gpt-5" ``` the config given above hits peak speeds 140t/s on 4090+3090 sometimes. but context is only 125k

Post Your Qwen3.6 27B speed plz

Posted by Ok-Internal9317@reddit | LocalLLaMA | View on Reddit | 234 comments

viperx7@reddit

Qwen3.6 27B FP8 Vllm Hardware. 4090+3090ti MTP: 16 Ctx. 125k Speeds varries between avg at 85t/s 50t/s at wrost to 141t/s peak I am still wondering if increasing MTP to this extent even a good idea or not (I don't see any disadvantage)

Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6

Posted by dionysio211@reddit | LocalLLaMA | View on Reddit | 177 comments

viperx7@reddit

Just a headsup I have a similar setup 2x24gb cards I am running 27B Q8 with image support + 125k context speeds are similar to yours (no kv cache quantisation) So you are potentially leaving huge performance on the table

Hard freakin' decision..Blackwell 96G or Mac Studio 256G

Posted by HyPyke@reddit | LocalLLaMA | View on Reddit | 212 comments

viperx7@reddit

That's not exactly the case for large code bases when where an agent needs to open multiple large files. Everytime it reads a big chunk you will see a dip in performance. And continuing a session with 100k plus ctx is just not an option when prompt processing is slow.

With 48gb vram, on vllm, Qwen3.6-27b-awq-int4 has only 120k ctx (fp8), is that normal?

Posted by Historical-Crazy1831@reddit | LocalLLaMA | View on Reddit | 16 comments

viperx7@reddit

I have a setup similar to yours. And given that you can run 27b Q8 with 256k context using Llama cpp . the speed gain for vllm isn't worth especially for awq 4 bit I just hope 27B -sm tensor get fixed with llamacpp then we can have best of both worlds

What speed is everyone getting on Qwen3.6 27b?

Posted by Ambitious_Fold_2874@reddit | LocalLLaMA | View on Reddit | 255 comments

viperx7@reddit

llama.cpp on a 4090+3090 setup I get TG 29t/s and PP 2500t/s I am struggling with settingup vllm can't seem to figure out optimal flags and exact model to use if anyone has similar setup and would like to share thier config I will be thankful

Qwen3.6-27B released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 142 comments

viperx7@reddit

okay okay even if it doesn't beat oput 4.5 outside these benchmarks. I will be happy if its an improvement over 3.5 27B. and if its improvement follows the same trajectory as the 35B\`s did. we are golden. Anyway i wont be able to run models bigger than this one anyway.

Qwen 3.6 27B is out

Posted by NoConcert8847@reddit | LocalLLaMA | View on Reddit | 609 comments

Waiting Qwen3.6-27B I have no nails left...

Posted by DOAMOD@reddit | LocalLLaMA | View on Reddit | 95 comments

Qwen3.6 GGUF is so good for debugging.

Posted by _BigBackClock@reddit | LocalLLaMA | View on Reddit | 19 comments

Qwen3.6-35B-A3B released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 721 comments

Qwen3.6-35B-A3B released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 721 comments

Best Local LLMs - Apr 2026

Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 364 comments

Is there anything better than Qwen3.5-27B-UD-Q5_K_XL for coding?

Posted by hedsht@reddit | LocalLLaMA | View on Reddit | 99 comments

Why bother with local LLMs?

Posted by West-Currency-4423@reddit | LocalLLaMA | View on Reddit | 39 comments

Weekend project with Intel B70s

Posted by dev_is_active@reddit | LocalLLaMA | View on Reddit | 41 comments

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

viperx7@reddit

u/PaMRxR u/jikilan_ u/eribob #### Qwen3.5 27B ##### Tensor config (speed 42t/s) `llama-server --host 0.0.0.0 --port 5000 -fa auto --no-mmap --jinja -fit off --no-op-offload -sm tensor -m Qwen3.5-27B-Q8_0.gguf` this leaves 4.9 GB free VRAM *Note:* up until yesterday i was able to load the mmproj with 27B in this config but since another llama.cpp update i can no longer (i hope it will be fixed soon as vram as enough VRAM is available) ##### Alternative config with mmproj (speed 29t/s) `llama-server --host 0.0.0.0 --port 5000 -fa auto --no-mmap --jinja -fit off --no-op-offload -m Qwen3.5-27B-Q8_0.gguf --mmproj mmproj-F16.gguf` #### Qwen3.5 35B `llama-server --host 0.0.0.0 --port 5000 -fa auto --no-mmap --jinja -fit off --no-op-offload -m Qwen_Qwen3.5-35B-A3B-Q8_0.gguf --mmproj mmproj-Qwen_Qwen3.5-35B-A3B-f16.gguf -ts 22,24` *Correction:* speeds with Qwen3.5 35B are at 128t/s and not 120t/s when starting with a context of 100k it goes down to 98t/s *PCIE config*: 4090 at 5GB/s 3090ti @ 25GB/s

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

viperx7@reddit

I tried to get this thing to work and it was a mess can you tell me how to use it. I have 3090+4090 on my system Couldn't get it to work with vllm

backend-agnostic tensor parallelism has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 60 comments

viperx7@reddit

Yesterday I tried to figure out vllm and man I have no idea what exactly I need to do Turns out I can't run the fp8 models because I have 3090 on my system which won't work Was able to load awq model but with half the context llama.cpp allows and speed wasn't that much better

backend-agnostic tensor parallelism has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 60 comments

People with low VRAM, I have something for you that won't help.

Posted by Uncle___Marty@reddit | LocalLLaMA | View on Reddit | 32 comments

Slower Means Faster: Why I Switched from Qwen3 Coder Next to Qwen3.5 122B

Posted by Fast_Thing_7949@reddit | LocalLLaMA | View on Reddit | 84 comments

viperx7@reddit

Can you tell me what quants you are using for both of these models. Currently, I am running 27b at Q8. And I'm wondering if 122B at Q4 would be better or not? On my machine. generation speed for Both of these models are kind of close. But there is huge difference in prompt processing speed.27B at 2500 vs 122B 190t/s. From what I gather from a couple of comments here and there is that 122 B is better than 27 B. Now my question is at what quant 122b starts beating 27B Q8

My Experience with Qwen 3.5 35B

Posted by viperx7@reddit | LocalLLaMA | View on Reddit | 80 comments

viperx7@reddit (OP)

Well, it's all subjective and depends upon what is the task at hand. Difficult to give any recommendations without knowing the actual use case. Personally I am more interested towards coding agents and for my use case, I moved from 24 gigs, to 36 gigs and now 48 gigs And I have to say the bigger your model is, the better your experience would be pretty much in all cases. And generally adding more VRAM is good, you will be fine getting more than one card. And about the llama models, almost all of them are severely outdated today

My Experience with Qwen 3.5 35B

Posted by viperx7@reddit | LocalLLaMA | View on Reddit | 80 comments

viperx7@reddit (OP)

Bro for a 70B dense model, the Q2 quant takes 26G just for the model and then you will need more space for the context. now combine that slow speed with a highly lobotomized model with a small context size and that's a recipe for disaster. If running a 70B dense model is your sole goal then I would have to tell you anything less than 48 gigs won't be all that good. But yeah, if you are okay in going to 30 B sizes, I think you can load all very good quality qwen 3.5 model quantised in 25 gb vram. Remember, it's not just about the size of the model, but also the size of the context, which is how long you can talk to the model.

Moonshot says Cursor Composer was authorized

Posted by davernow@reddit | LocalLLaMA | View on Reddit | 54 comments

viperx7@reddit

that's my point as well if they can do the 75% of the training themselves then why use the 25% from the base. the incentive for cursor to appear that they are not beholden to any other company for models. and they have thier own model. which moves them just being a wrapper. a lot of tweets initially were also meming about a code editor team of 40 people beating the big labs by making a superior models. (i suspect thats the image cursor wanted to go out)

Moonshot says Cursor Composer was authorized

Posted by davernow@reddit | LocalLLaMA | View on Reddit | 54 comments

viperx7@reddit

If someone has an account on together compute, they can clearly go ahead and see if there is an option to fine tune kimi k2.5 or not to or not, right?

Moonshot says Cursor Composer was authorized

Posted by davernow@reddit | LocalLLaMA | View on Reddit | 54 comments

viperx7@reddit

but in another tweet, cursor team claimed that they have fine tuned the model for the and only one-fourth of the total training compute is from the original model and they have rest of the three-fourth training themselves. I find it a little hard to believe !!!

Moonshot says Cursor Composer was authorized

Posted by davernow@reddit | LocalLLaMA | View on Reddit | 54 comments

viperx7@reddit

Kimi made this post way after the whole controversy broke out I think if it was official from the start they would have congratulated them at least in first six hours of launch if not earlier. I think what happened was after everything broke out the cursor team went ahead and talked to Kimi and then they agreed on this ex post facto. Also, it is very hard to believe that people who were training the model, they have no information about a big potential fine tune of their model being integrated in one of the largest platform.

Apparently Minimax 2.7 will be closed weights

Posted by tarruda@reddit | LocalLLaMA | View on Reddit | 54 comments

viperx7@reddit

i just hope that they didn't just change their mind after witnessing the kimi k2.5 rebranding by cursor. it makes sense why spend money and resource when someone is going to steal your work and call it as your own

My Experience with Qwen 3.5 35B

Posted by viperx7@reddit | LocalLLaMA | View on Reddit | 80 comments

viperx7@reddit (OP)

i had a 3060 i would advice you not to get it. it gives you VRAM for sure but its processing power leaves much to be desired. I used to have a 4090+3060 setup and 3060 was way way slower would say if you spend a little bit more and get a 3090 it will be worth it. and yes with 24GB VRAM you cant run 70B quantized (the speeds you will get will be so abysmal that it wont be worth it.