I have installed llama.cpp and qwen3.6 27b for coding but too scared to try it...
Posted by bonesoftheancients@reddit | LocalLLaMA | View on Reddit | 25 comments
First - I am a vibe coder with no real knowledge of coding languages other then Basic and JS. However, using LLM coding I managed to create python and cpp software that works exactly like i want it to.
basically i've been using antigravity with claude and gemini and tbh claude proved to be the most reliable for coding so far BUT expensive. I have installed llama.cpp 3.6 27b IQ3 XXS (I have a 5060ti) but keep using claude because im scared it will screw up my code or the very least just waste my time... is it good enough for production? Do you feel you need to have more coding knowledge and experience to use it compare to using claude?
Also, what coding UI do you use it with - I want something that "remembers" context and automate execution (like antigravity or gemini-cli do)
danihend@reddit
Nothing to do except try it out. There's really no other way to get familiar other than comparing results. You can also have a frontier model evaluate the difference between two models' outputs so you can get an idea of where it might be weak. I would advise you to stick heavily to GPT5.5, Claude 4.8 for the detailed planning. Deepseek 4 us interesting for implementation due to cost. Also Composer 2.5 is great in that regard and generous usage in their plan, along with some GPT/Claude usage too (less)
fabyao@reddit
AI is a reflection of your knowledge. The less you have the worst your instructions/prompts are. Even the most expensive LLM will produce sub-optimal work. I am a software engineer with a CS degree. I recently asked Claude to produce an infrastructure as code project with Terraform which I never used before. At some point during the project they were certain aspect of Terraform which didn't make sense. I then paused and learned the basics. With my new knowledge it was clear that my project didn't meet the standards, difficult to maintain and rigid. I started from scratch again. Also I am using Qwen3.6 27B Q4 which works great.
supracode@reddit
Vibe coding is an ok way to get started and learn, but if you want to build something real, you need to create a plan. To "pick up where you left off" you will need to tell your agent to follow your plan, broken up into smaller md plan files. After each step tell the agent to mark step x completed. That will create the memory so you can pick off where you left off. In your plan, tell the agent to create unit tests, and if you are building a ui, look into Playwright for UI testing. Tell the agent to create a documentation folder and keep it updated as the project grows. If you keep that in context, it won't need to scan your whole project to understand what needs to be worked on.
TimmyIT@reddit
As mentioned just copy it in to a new project or branch and test it out. Run both side by side and see the result. Depending on your goal they can work together in a hybrid scenario. Example on that could be that for some more complex tasks, let Claude develop it and let Qwen verify it.
Another hybrid approach is to have Claude develop it but you try to save as many tokens as possible. For example don't make the code be human readable and remove any unnecessary spaces in the code. Then you take the code and let Qwen make it human readable and add comments.
goldcakes@reddit
The second part is terrible advice I'm sorry.
First, all LLMs will write worse code when you tell it to make it not human readable.
Second, if you're still doing the main coding with Claude, you'll be having cache misses (from the Qwen edits) and wasting your usage/tokens.
TimmyIT@reddit
Thats a fair point and some good advice. My comment was more aimed towards helping OP think out-side of the box and come up with other scenarios on how to use and play around with LLMs.
fasti-au@reddit
+## Hardware
+- **Host:** ASUS X299, Intel i9-10980XE, 64GB RAM
+- **GPUs:** 3x RTX 3090 24GB + 1x RTX 3090 Ti 24GB
+- **PCIe:** 2x PIX pairs (0-1 CPU-direct, 2-3 PCH-routed)
+- **Power:** GPU0/3 at 350W, GPU1 at 300W, GPU2 at 250W (X299 ASPM instability)
+- **Docker:** vllm-bee:nv (4.84GB, TurboQuant + DFlash built-in)
+
+## Config A: 35B-A3B MoE (WINNER)
+
+| Metric | Value |
+|--------|-------|
+| Model | Qwen3.6-35B-A3B-MTP-UD-IQ4_XS.gguf (17GB) |
+| Drafter | dflash-draft-35b-a3b-q4_k_m.gguf (279MB) |
+| KV Cache | turbo4 K (2x compress) + turbo2 V (4x compress) |
+| Spec Decode | DFlash n-max=5, reasoning OFF |
+| VRAM Used | 18.1 GiB |
+| VRAM Free | 5.5 GiB |
+
+### Per-Card Throughput
+
+| GPU | Power | TPS (code gen) | Stable? |
+|-----|-------|----------------|---------|
+| GPU0 3090 | 350W | 128-139 t/s | ✓ |
+| GPU1 3090 | 300W | 140 t/s | ✓ |
+| GPU2 3090 | 250W | 140 t/s | ✓ (at 250W) |
+| GPU3 3090 Ti | 350W | 154-155 t/s | ✓ |
+| **3-card total** | | **434 t/s** | |
+| **4-card projected** | | **\~574 t/s** | |
+
+### Context Recall (GPU0)
+| Context | Recall | Time |
+|---------|--------|------|
+| 4K | 100% | 1.9s |
+| 8K | 100% | 2.6s |
+| 16K | 100% | 4.4s |
+| 24K | 100% | 6.2s |
+| 32K | 100% | 2.5s |
+
+**No recall wall within 32K context window.**
+
+### Worker Budget
+- 5 workers at 32K ctx per card (\~1.3 GiB KV each)
+- 2 workers at 64K ctx per card
+- 3 cards = 15 concurrent workers at 32K
+
+## Config B: 27B Dense
+
+| Metric | Value |
+|--------|-------|
+| Model | Qwen3.6-27B-MTP-Q5_K_M.gguf (18.6GB) |
+| Drafter | dflash-drafter-3.6-q4_k_m.gguf (\~1GB) |
+| KV Cache | turbo2 K + turbo2 V (4x compress both) |
+| Spec Decode | DFlash n-max=5 |
+| VRAM Used | 18.9 GiB |
+| Speed | 53-80 t/s per card |
+
+**35B MoE is 2x faster AND uses less VRAM than 27B dense.**
+
+## Key Discoveries
+
+1. **`--gpus device=N` works with vllm-bee:nv** — previous skill docs said it crashes
+2. **DFlash on 3090 Ti confirmed working** — 155 t/s on Ti
+3. **35B MoE > 27B dense** — only 3B active params out of 35B
+4. **turbo4 K + turbo2 V** — protect attention keys, compress values hard
+5. **reasoning OFF** — prevents Qwen3.6 token waste on structured prompts
+6. **X299 ASPM** — disable in BIOS, keep ≤300W per GPU for stability
im too busy to reboot but thats your perfect coder as far i can find for 64K context per worker
FullstackSensei@reddit
You'll have a bad experience with Q3.
Instead of being afraid, how about you try to learn some stuff? YouTube is full of courses, including full university courses about the fundamentals of CS. You can probably learn a good deal of the fundamentals watching 40-60 hours of videos, with an equal amount of time practicing. That's 3-4 months at one hour a day. Even if you keep using Claude, you'd save a ton of money because you'd actually have an idea of how to do things rather than thrashing the LLM around to see what sticks
goldcakes@reddit
If you don't learn well from videos, there's still plenty of great books. The key is fundamentals of programming and software engineering. Computer science is nice, but not essential.
I found a lot of CS materials to be too math-y and that's just not how my brain works, but just following a few books really helped.
FullstackSensei@reddit
In my experience, there are more high quality DS&A and programming courses published by CS departments.
You absolutely don't need to dive into the math heavy stuff. They're not freshmen level anyway
BigYoSpeck@reddit
This is where you want to learn about version control. If a coding agent gets a task wrong it shouldn't "screw up" your code. You should be able to roll it back or dispose of the changes
27B is at least in Q6-Q8 form quite a capable model. I don't know how much damage going down to Q3 does but I would wonder if 35B in full Q8 while ultimately less capable of complexity would at least be more reliable in not making errors
For making a clean sheet UI, Qwen is really capable. Not Claude level for something complex, but for simple bits and pieces it probably can churn out components which are nearly if not just as aesthetic.
I use Claude and Codex in VS Code for the big things, but I have llama.cpp in model router mode added to Copilot via customendpoint
Even at Q8 it isn't going to come close to Sonnet/Opus or GPT 5.4/5.5. But for the little things you might use Haiku or GPT Nano for it's not as fast, but probably more capable
awpenheimer7274@reddit
Aye dumdum don't be afraid.
Just copy a codebase and test it's capabilities first. You have agents you trust right? So for starting out - give it one file from a codebase - ask it to improve it - copy paste it's output to the ai you trust - cross verify actions until you are satisfied/you have a feel for the limitations.
The more you experiment with different models the easier it gets. And also don't forget to experiment with the llamacpp knobs. I made my own and I'm experimenting with 3070+3060 = 20GB vram on some of the worst bottlenecked hardware, the llama controls are the difference between 5t/s and 55t/s. I use pi agent. It's simply the best fit.
And as always - trust but verify the settings that agents give you to run llama.cpp - I've noticed that they ask me to run 10k context windows when my incremental testing proved that I can run up to 50k. They always lowball imo.
Also - local llm for production use? Not if you don't have rtx pro blackwell 6000 or dgx spark. Pay cloud subscription. Local is simply not worth it right now.
alexwh68@reddit
Vibe house building, get some bricks stick them on the ground and start building. Can you build a house, yes, can you build a house that lasts….
Q3 should be used by people that know the difference between good code and crap, because at Q3 there is a good probability that there will be crap code. Q6 and above is where things get sensible.
If you don’t have a machine with a sensible amount of ram and good GPU performance, stick with the api’s.
Personally local starts to work reasonably well at 64gb of ram (on a mac) and the more you have above that the better things get.
Use git and backups to protect your code just in case AI trashes your work.
Healthy-Nebula-3603@reddit
Bro NEVER goes below q4km /l / xl
Even q4 is slightly retarded for coding.
For instance books translations you can't use anything below Q8 ( fp16 recommend) because you easily notice in translation degradation ( missing a lot nuances)
Endurance_Beast@reddit
Yeah, dont use anything below Q6 for coding
Tema_Art_7777@reddit
for coding context size is important. 27b will only give you 20-32k at best on that card (i would not use anything under 4 quant) and it will not be fast. For basic coding, I would try 9b that actually fits on the card with good amount of context size.
autisticit@reddit
Nothing is good enough for production if you can't understand the code.
jopereira@reddit
I run this on my 5070ti:
turboquant-plus-tqp-v0.1.1-windows-x64-cuda12.4\llama-server.exe -m J:\LM_Studio_Models\bartowski\Qwen_Qwen3.6-27B-GGUF\NO MTP - Qwen_Qwen3.6-27B-IQ3_XXS.gguf --fit-target 0 --jinja --min-p 0 --presence-penalty 1.5 --repeat-penalty 1 --temp 0.6 --top-k 20 --top-p 0.95 -c 160000 -ctk turbo3 -ctv turbo3 --host 0.0.0.0 --port 4321 -np 1 --reasoning off
It works VERY WELL (not expected it to be that good).
I recently encountered a problem it couldn't solve so I tried DeepSeek V4 PRO, Nemotron Super 3, Grok Build 0.1 and Claude Opus 4.6 - none of them did the trick either.
It sounds impressive because it is!
note: the current version of bartowski is MTP-enable. You have to download the previous version (non-MTP).
jd52wtf@reddit
I stall and learn some GIT. For a case like this you can create a separate branch/fork that can contain alternate trial builds using the local models. If things go bad you just move back to the main branch and everything is back the way it was.
If things go well you can pull over/merge the changes back into the main branch.
GIT is one of those programmer superpowers that everyone should know about.
Also keep in mind that I'm not talking about GitHub. That's also useful to look into but local GIT should do the trick.
Good luck!
MachineZer0@reddit
My daily drivers are now Claude Code w/Sonnet 4.6 and Pi Code with Qwen3.6-27b. I’d say the Pi setup is 95% good of Claude setup. Pi/Qwen is definitely way more verbose; maybe 20x more, unless pi is showing cache tokens. I host Qwen on vLLM/RTX 6000 pro Blackwell 96gb (Runpod shared with a dozen people, otherwise RTX 5090 is adequate for 1-3 concurrent users). The Pi setup seems way faster than Sonnet. However it tenaciously brute forces a lot. Sonnet completes tasks faster, even though time to first token and tok/s appears slower. Maybe I need to mess with reasoning budget. But concerned it may impact quality of output.
Use version control and review/test changes between prompts. If you use Pi/Qwen, be very succinct with prompts. It tends to carte Blanche often and go beyond the task if you don’t set goal and definition of done explicitly. I have to stop it mid-tool call several times a day since it is overzealous.
goldcakes@reddit
I strongly, strongly recommend you to learn a little bit about how to code. I promise you there is even joy in it, even if it's frustrating to learn (I was trying to learn on and off for a couple years, before one day where things finally started 'clicking').
Having this knowledge is going to help you work with AI coding agents a lot more; not just debugging but especially in terms of prompting better. There's concepts common to all programming languages -- I'd say I'm only 'fluent' in Python and JS, but I can read and more or less understand code in any language.
HornyGooner4402@reddit
Any LLM is gonna fuck up your code if you have no knowledge of what it writes, no matter what the price tag or benchmark says
NigaTroubles@reddit
Have you watched The Matrix and Terminator ? Yes thats what will happens to you
HelloSummer99@reddit
Well, why change something that works for you? You don't have to use local LLMs. Claude is not even that expensive for the moment. It's very likely that a local 27b model will either not even run or will underperform. 5060ti is just 16GB VRAM
secunder73@reddit
just try it on different branch or a new project, its capable but not enough for someone