Qwen 3.6 27B BF16 on RTX6000 Blackwell - One Shot Test
Posted by Demonicated@reddit | LocalLLaMA | View on Reddit | 26 comments
Since it's hard to translate benchmarks into "Is this model good at work?" I decided to run a very simple test with the new qwen3.6 dense model release. Its super chatty on LM Studio (where I have it running) but it works. My prompt:
"Create an html file that i can open that has the complete game of pacman with the first level."
It took 41 seconds @ 25 tok/sec and gave me a snippet that almost worked right off the bat. There was a runtime issue:
pacman.html:679 Uncaught TypeError: Cannot read properties of undefined (reading '0')
at drawMap (pacman.html:679:27)
at draw (pacman.html:838:5)
at gameLoop (pacman.html:866:5)Understand this error
Another 51 seconds later it had finished spitting out the complete html file again with the fix. It definitely likes to re-write the whole file instead of just the updated sections. After the next run there was a movement glitch. Another 50 seconds later and I had a really good pacman clone running with the first level completed.
Thoughts:
I think this could absolutely be a daily driver. Had I used my normal flow to create a design document first and iterated on that prior to implementation I have little doubts this model could handle the implementation.
Realistically, I work in huge code bases where context is king so I think my experiment for this next week will be to use Sonnet/Opus in Plan mode to spit out detailed design docs and then use this local model to do all the implementation. Seems like the natural way to survive in the ever shrinking subscription limits reality these days.
My guess is we are about 2 local models away from having something like Sonnet 4.6 running locally in which case, we'd only need SOTA models for planning phases, difficult debug sessions, and pen-testing.
Nepherpitu@reddit
Please, use your RTX 6000 with sglang or vllm. You will get around 120tps on coding tasks with 27B model.
Sea-Ad-5390@reddit
Any chance you can share your vllm parameters? That is not what I’m seeing on my end and I keep getting tool call crashes randomly. Trying to update to the new cu130 nightly to see if that will resolve it, so far so good.
gtrak@reddit
I have been running what you are suggesting for 2 months now, cloud model as planner, qwen at q4 or whatever on a single 4090 as implementer, and it's a great way to stretch the expensive/scarce tokens further, even on larger codebases. I think a review/implement loop is critical though, definitely something you should try.
solidsnakeblue@reddit
I've been trying different configurations myself. Personally, I've found it better for my use case to use cloud models as orchestrators and builders, and local models as sub-agents for research, heavy context intake tasks, and Devil's advocate style review at key points during planning.
grumd@reddit
Can you tell me more about review/implement loop? Do you use a framework or something to set that up? Do you use a more expensive model as reviewer?
gtrak@reddit
Yes, I am hacking on my own agent right now that is a pile of code to do the same thing, but I first built it on top of GSD. It's pretty simple in concept:
relevant snippets:
execute-phase additions:
https://github.com/gtrak/get-shit-done/blob/0c863c8590dfed968163f30c6e77e81d9ba9f7c4/get-shit-done/workflows/execute-phase.md?plain=1#L178-L200
https://github.com/gtrak/get-shit-done/blob/0c863c8590dfed968163f30c6e77e81d9ba9f7c4/get-shit-done/workflows/execute-phase.md?plain=1#L513-L669
reviewer agent prompt:
https://github.com/gtrak/get-shit-done/blob/0c863c8590dfed968163f30c6e77e81d9ba9f7c4/agents/gsd-code-reviewer.md?plain=1
grumd@reddit
Big thanks my friend! This will definitely be helpful!
gtrak@reddit
I had to actually run the numbers
grumd@reddit
Do you have something like consistent test coverage? General plan, architecture documents? How do models begin to understand a project with 100k loc being added per month? You have a documentation that they keep maintained?
gtrak@reddit
I also intend to have a separate integration test suite that is more manually maintained, but still prototyping.
gtrak@reddit
I don't have a great answer for that yet. GSD in my opinion just does too much at once, so I've gone simpler to just sticking to plan mode and emitting a markdown file, then using GSD to split it into serialized work, not to do the planning.
For keeping the codebase consistent with docs, I think I really want the same approach as https://www.lat.md/ , so I'm considering extracting specs and rewriting the whole thing again.
This codebase is at 45k lines right now. That big bang rewrite day was extracting to individual crates, and it helped to have an ARCHITECTURE.md to lay out the layers and what can depend on what.
Today, I'm reworking all the SQL/persistence and making the types more restrictive, which broke it all again.
NNN_Throwaway2@reddit
The question is, who will be producing models in this size class two generations from now. Alibaba seems to be drastically scaling back on open source. Gemma is kind of the only alternative at this point and we don’t know their future plans either.
Demonicated@reddit (OP)
I think China has every intention of keeping it going. It's a form of economic warfare. These American SOTA companies are bleeding capital trying to win the AI game and already gave us Opus 4.6, which i consider to be the "good enough" model that predictably increases productivity.
As long as China (and any other open source lab) keeps at it then it basically makes their investment fruitless. All those investors lose out.
The next thing coming is gating hardware as a last ditch tactic for controlling consumers. You know free market and such /s
NNN_Throwaway2@reddit
I'm not really seeing the evidence from that. By the looks of it, Alibaba is has open-sourced all they intended for Qwen 3.6. GLM has abandoned the Air size class. While I don't doubt we will continue to get large open models, at least for a little while, its increasingly looking like models that are small enough to run on consumer hardware are not being developed anymore. Or, when they are, they aren't anywhere close to the frontier.
Demonicated@reddit (OP)
They just dropped the dense model today and have expressed commitment to continuing to do so.
Also I would expect, like with the new Intel accelerators we will see more companies coming into the hardware scene. These guys will need OSS models to exist to sustain consumer sales. There's a lot of people out there willing to drop 10-20k on local inference. Someone will want to claim that market segment and prosumer and local models have a symbiotic relationship.
You are correct they won't be frontier though. But as a heavy user of inference I can say that for the most part SOTA has already surpassed 99% of my needs at opus 4.6. We just need good enough to do real work and I can tell you that gemma 4 is an absolute work horse for all my SaaS agents. I have solar and batteries and have entire products that run for "free". I don't need the best, just good enough. And I think we're there so any yearly update is good enough for me in the business sense.
NNN_Throwaway2@reddit
I haven't seen any evidence that consumer demand is actually driving hardware sales in any meaningful capacity. It looks more like companies are trying to shove AI in every product hoping something will stick.
If you actually read the blog post for Qwen3.6 27b, there is no indication they intend to release any more 3.6 models, and they appear to imply the family is complete.
Demonicated@reddit (OP)
NNN_Throwaway2@reddit
"For now" right after they let go one of the main champions of their open source work.
All indications are that they are moving in that direction. We're not getting an open-weight Qwen 3.6 Plus. We're also apparently not getting any other sizes beyond the 27B and 35B. Maybe the blog post was just extremely misleadingly worded, I guess we'll see.
Porespellar@reddit
China is also probably going to dominate in inexpensive humanoid robots, they will win on both price and features most likely.
Puzzleheaded_Base302@reddit
nvidia can release new nemotron. maybe AMD, intel will also join the race to release open-weight models.
Johnwascn@reddit
Have you compared it to the Q4 quantized version of Qwen3.5-122B? I use this model frequently, and it's currently my most satisfying model.
NebulaBetter@reddit
Can this card handle the full context window? I have the Pro too, but I’m using FP8 since I’m not sure the full context would fit in FP16.
Demonicated@reddit (OP)
I was using about 70GB with full offload and full context. In LM studio I turn off try mmap.
awitod@reddit
I think it means that we are at a place where we can make a very wide range of good applications driven by local AI.
We really only need a step up beyond what we can do locally for large unbounded problems at this point.
If you can bring some engineering to a repeatable thing, you can probably put it in a box in a server room or under a desk.
qubridInc@reddit
Feels like the sweet spot now, use strong hosted models for planning and let local models handle most of the implementation work.
Demonicated@reddit (OP)
Yeah kinda what I'm feeling. But I need to play with it more to feel confident we actually hit that sweet spot. It's clear the venture capital is running out and these SOTA closed sources companies are looking to start doing the squeeze.
I'm hoping we can get some 70B sized models again soon. I feel like a 70B q8 will be really nice for exactly this workflow.