Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 87 comments

For two weeks I ran my multi-agent orchestrator entirely on Qwen3.6-27B via Ollama, on a single 3090.

The goal: see if a local model could replace Claude as the reasoning layer for the lead/manager/sub-agent loop. Here's where it worked and where it broke.

Setup:

- RTX 3090, 24GB VRAM

- Qwen3.6-27B at Q6_K (\~22GB on-GPU), 32k effective context

- Ollama as the inference engine

- Multi-agent orchestrator with structured-JSON plans, plan-approval modal, auto-review pass after sub-agent completion

- Tested across 47 multi-step coding workflows over two real repos

What worked (the reasoning layer):

- Plan generation. Qwen3.6 generated multi-step plans roughly as well as Claude on these tasks. Slightly more conservative (fewer unsolicited "let me also refactor X" steps), but coherent and schema-valid at \~95% after a few prompt tweaks. The remaining 5% were schema fixable with one re-prompt.

- Memory extraction. Mem0-style fact extraction every 6 turns worked fine. Qwen pulled out the same kinds of facts Claude does ("user prefers no comments unless they explain a 'why'") and stored them cleanly in Qdrant.

- Auto-review of sub-agent output. A second Qwen instance reviewing the first one's code caught roughly 60% of the bugs Claude's review caught on the same set. Less savage. Still useful and free.

Where it broke:

- Tool-call reliability. Qwen3.6's JSON tool-call output had a \~12% format error rate across the 47 tasks. Claude was \~0.5% on the same workload. The errors weren't malformed JSON they were wrong field names, wrong types, hallucinated tool signatures. Outlines / strict-output mode reduced it but didn't kill it.

- Long-context drift. Past \~14k tokens of accumulated session context, Qwen started misremembering decisions it had made earlier ("you said use Postgres" no, I said the opposite). Hard practical limit \~12k tokens, then aggressive summarize-and-reset.

- Cascade-failure handling. When a sub-agent failed, Claude's planner usually noticed and re-planned. Qwen sometimes just generated downstream steps assuming the sub-agent had succeeded. Three cascading hallucinations in 47 runs. Not catastrophic with plan gating in place. Would be catastrophic without.

The contrarian take: Qwen3.6-27B is a viable REASONING layer for local multi-agent systems today. It is NOT a viable execution layer. Run plans through it; gate every tool call.

Practical implication: if you're building local-only agents, you need (1) structured-output enforcement at the tool-call boundary (outlines, lm-format-enforcer, or your inference engine's grammar mode), (2) plan-approval gating so the 12% format errors don't reach actual file writes, (3) re-plan-on-failure logic the model itself can't be trusted to do.

The 12% tool-call gap is the metric to close. Once Qwen3.6 (or the next local model) hits \~2% on this, the case for cloud reasoning in agent loops gets weaker fast.

Disclosure: the orchestrator I tested this on is OpenYabby (openyabby.com). I built it. Tested honestly because I genuinely wanted to know if I could stop paying Anthropic.

[-]

kiwibonga@reddit

Do you know if you used the broken jinja template distributed by Qwen and unsloth or did you grab one of the fixed ones?

[-]

Interesting-Sock3940@reddit (OP)

i'm going to manually patch the jinja file this weekend and re-run the benchmark if that fixes the 12% error rate, it changes the math entirely

[-]

g_rich@reddit

Also try running it via llama.cpp or even LM Studio; really anything but Ollama.

[-]

Far-Low-4705@reddit

no dont try to manually patch it, thats a very bad idea. chat templates tend to be very complex and the model is VERY sensitive to it.

Just use a unsloth gguf quant. they are very high quality, and made by respected ML engineers. if you want, i would at minimum say that it is better to pull an unsloth quant, and inspect/verify it for yourself rather than trying to patch it yourself.

[-]

Prestigious-Ant-2267@reddit

The template thing is real. I spent a week thinking Qwen3.6 had gotten worse on tool calls until I swapped to froggeric's fixed jinja and most of the malformed ones went away. Ollama still ships the broken one.

Also worth checking what kv quant Ollama picked. To fit 32k on a 24gb card it usually defaults to q4 or q8 kv. Cache quant eats structured output way before it eats prose, so your "decisions drift past 14k" might be cache rot, not real context limits.

Would rerun with fp16 kv and the fixed template before saying Claude wins.

[-]

bgravato@reddit

how can one tell if it's broken and which are the fixed ones?

[-]

kiwibonga@reddit

Typically if you didn't download a fixed template willingly, you're using the broken one that Qwen shipped.

[-]

YourNightmar31@reddit

So what do you need to do to fix it if you dont know if your setup is broken or not?

[-]

bot9998@reddit

Instructions are here for each environment: https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Basically download the fixed template, point your build at it, and most of the tool issues stop

[-]

EbbNorth7735@reddit

I think all the new quants contain fixed templates from unsloth

[-]

geek_at@reddit

didn't unsloth fix their template weeks ago?

[-]

mp3m4k3r@reddit

Yeah like within the first 24h iirc https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/commit/0607ad0dc48434b1457bcc71e6286bbd6f885c5f

[-]

kiwibonga@reddit

No file changes on the jinja file that I'm aware of.

[-]

M4GMaR@reddit

Could you consider using Unsloth Quants for this? I believe using UD Q6 or even Q5 would fix the tool-call hallucinations.

[-]

llllJokerllll@reddit

No uses ollama, pasate a llamar cpp y optimiza con las flags el modelo cargado, verás que te mejoras consistentemente

[-]

Prudent-Ad4509@reddit

"Qwen3.6-27B at Q6_K (\~22GB on-GPU), 32k effective context"

This here is your main problem. Such complex use should run with at least 128k unquantized kv cache. Q6 is also is not good but the context breaks things.

[-]

vick2djax@reddit

The real problem is Ollama

[-]

Hypilein@reddit

What’s the problem with ollama? (I’m a noob, don’t judge if this is obvious/well known).

[-]

TheTerrasque@reddit

Many, but in this context specifically it's not the first time I've heard qwen3.6 having broken tool calls and problems with long context on ollama, which isn't a problem when running directly with llama.cpp

I'm not sure if it's parameters, template, model itself that's broken or a combination of all those, but it's a trend. It's probably also running much slower than with llama.cpp directly.

[-]

Endurance_Beast@reddit

The problem is that they took an F1 super car (llama.cpp) and put a truck body on it.

[-]

vick2djax@reddit

It’s incredibly unoptimized and has so many layers bloat on top of it that it just causes endless issues. And you’ll run at about half speed using it. It’s just a terrible product.

[-]

migsperez@reddit

It's a useful application for people to get started. Learning curve is reasonable. I enjoyed using it. I've now moved on.

[-]

Paradigmind@reddit

Just use Kobold.cpp if you are a beginner. You have just 1 .exe that you can run without worrying.

[-]

rainbyte@reddit

They copied Llama.cpp without giving credits, and to make it worse they broke things while doing so. Later they rewrote parts and reinserted bugs already solved on Llama.cpp.

Things like that, and also their model (re)naming schema is misleading.

[-]

Miserable-Dare5090@reddit

It’s not an efficient tool, it is moving towards cloud use and not local, away from llama.cpp which is what it actually runs underneath, it bloats the llama.cpp runtime and auto-picks settings for you that are going to be suboptimal, such as a low quantization if you ask it just to download the model, and cache quantization which breaks the tool calls like the OP posted. So you install ollama, say ollama run qwen3.6-27b and then complain the model is not good when in reality you used a poor runtime, didn’t tune it to run well on your HW and for your use, etc. So it’s easy to use, but dumb to keep using.

[-]

Prudent-Ad4509@reddit

Every bit helps. I did not mention the obvious that vram should be increased as well.

[-]

Far-Low-4705@reddit

who said he was using kv cache quantization???

the Q6_K is the quantization of the model, which is actually a really good quantization quality.

[-]

KiDNEXTDXXR@reddit

Very capable if done right. I have my own fine tuned q5 that does EVERYTHING

[-]

Prudent-Ad4509@reddit

I did not say anything about his kv quantization because he said nothing about it.

As for Q6_K, you do what you've gotta do when you do not have enough vram. Still, this shows in edge cases in the long run. Q8 is considered roughly lossless for most cases, but it is still better to use BF16 for parts what matter the most like attention and kv.

Quality quantization is done at the cost of edge and complex cases, this is exactly what you are paying for it. So, while you receiving diminishing returns when moving from Q6_K to Q8 and then to BF16, you still receive important returns when you want to do something very non-trivial.

[-]

KiDNEXTDXXR@reddit

I run a personal 7b q5. Runs my own coding language and structure perfectly makes websites and debugs in one pass and ships them live on a personal server that runs all on my own code as well as drivers and engines. All with 16gb ram ryzen 5 2600 and 1660ti. Even can decode any language into our own and it ran by 4 files and only 68bytes of data

[-]

Interesting-Sock3940@reddit (OP)

ok will try it

[-]

voyager256@reddit

As few people already commented there are few problems: too small context size and possibly KV quantization. Also Q6 may be fine if it’s good quant, but a decent Q8 for weights would be better . Obviously it also depends on harness, agent.md etc. But the first things mentioned would require around 48GB VRAM so e.g. a second 3090.

[-]

Such_Advantage_6949@reddit

i ma running the model above past 128k frequently and it still code well. I think your cache quant really hurt long context, 12k is half the system prompt of claude code that people run this model with

[-]

shadow1609@reddit

OP clearly has a skill issue. I was already out reading Ollama. We use Int4 autoround up to 200k context, 4-10 Agents in parallel, no issues beside from tps dropping.

[-]

Melodic_Reality_646@reddit

How much it’d cost to have that setup, ignoraring wild scavenging eBay to find good deals?

[-]

Prudent-Ad4509@reddit

262k context would cost a lot in the times of yore. However, these days I can run 35B Q8/FP8 with 16-bit 262k context on 2x5090 with some vram to spare. Qwen3.6 27B FP8 is just shy of 1gb, so 2x3090 should have plenty of room.

PS. Yeah, I've just searched around on reddit and 27B Q8/FP8 with 262K 16bit context takes less than 41gb of vram. Not enough to fit the whole 35B there but 35B Q6 should fit as well I think.

2x3090 does not cost much. It ain't 12x3090.

[-]

migsperez@reddit

I'm curious where did you get the 32k effective figure?

I have noticed much better results when context is lean.

[-]

Prudent-Ad4509@reddit

That was a quote from above. And here is a quote from the official model page:

The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.5 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

[-]

jonas-reddit@reddit

I’m running 64k context and context is my biggest problem.

It’s not the LLM, in my case, it feels more like the tooling (pi.dev) doesn’t always resume cleanly depending of how awkwardly I run out of context.

But I’m 2m tokens on unsloth 27b on a 3090 and quite happy and definitely productive.

[-]

migsperez@reddit

Do you use it for coding. Are you on the q4 model? Has it been decent enough?

[-]

Rasekov@reddit

Not sure if helpful, it did help a bit for me, but I had pi write an extension that appends the current context usage at the end of tool calls so shit doesnt get compacted in the middle of something important. The agent ends up being a bit paranoid of context use and sometimes you can see in it's reasoning that it doesnt want to do even a small edit with over 20K tokens left "because it might not be enough" but on the other hand I havent had any issues with the agent running straight into the context limit and compacting the context in the middle of something critical just to lose important info and start messing things up badly.

[-]

looselyhuman@reddit

Try thresholds. 75%, 90%. Anxiety compounds. A running update every tool call will have more behavior impact than an occasional note.

[-]

Rasekov@reddit

I might try thresholds, I already do some skipping in sequential tool calls and such. Mind you all the issues I have are with the smaller/dumber quants(Q3 for example), Q4+ usually behave correctly.

Initially I also appended it at the end of the agent's messages if they used over X reasoning tokens(for example, 5K or 10K reasoning tokens) but the lower quants of Qwen 3.6 27B always ended up hallucinating a need to add it themselves, then there would be 2 messages about token usage and they would hallucinate the need for a 3rd one... In general this, as I have, works well enough, specially with better quants(UD Q4 or Q6), EXL3, ...

It was a simple enough addition as an extension and in my use helps enough to keep it.

[-]

JustSayin_thatuknow@reddit

Great testing, except for the ollama part. Can you make the same tests with llama.cpp and report back? Thanks and keep it up!

[-]

openSourcerer9000@reddit

I think the context factor is a qwen thing, and I suspect it's finicky and susceptible to slight template or other issues.

Not sure if there's a solid non-qwen agentic coder in that param range (maybe glm flash, mistral?), but I've found minimax is able to crank for at least 60k tokens without losing the plot, while qwen 397, while an excellent reasoner, has terrible problems with muddling context together.

I would try out different models for different tasks and see if you get better results.

[-]

KiDNEXTDXXR@reddit

I run everything with a q5 self tuned coding agent on a 1660ti with my own drivers and engines

[-]

Look_0ver_There@reddit

I use Qwen3.6-27B regularly. In my experience for longer context work it really needs Q8_0 weights at the bare minimum, or Q8_K_XL preferably. KV cache should be either F16 or BF16 for best results.

You can get by with higher quantizations but the chances for it to drift off course rises. Using the settings I mentioned above it'll happily run to 160K context depths while still doing tool use properly.

It isn't great at trying to analyse individual contexts above about 60-70K in one go. I.e. you cannot feed it 3000 lines of code at once and expect it to have as full of a grasp as it does over 1000 lines of code, but it still does pretty well.

The moral is, IMO, understand its strengths and limitations and work around them, and it'll work almost as good as most any frontier model you care to name from about 6 months ago, which is an amazing achievement when you think about it.

[-]

bigh-aus@reddit

I moved from nvfp4 to q8 and found a major improvement in tool call success. with nvfp4 it would just "stop", and you'd have to prompt it to "keep going"

I still have some model issues - i wonder if making the kvcache bf16 would help... Good point I'll try that next, I have the room (rtx6000 pro)

[-]

voyager256@reddit

Regarding KV cache quants there is conflicting tests / data on to what extent fp8 or Int Q8 (with recent rotation enhancement) KV degrades quality . Even With 128K context the fp8 the quality loss was negligible in one fairly recent test(I will try to.find it and link here) and it also was less with the dense model IIRC . The biggest measured difference was that the bottom 99.9% of predicted tokens was only about 95% vs fp16. Also apparently the new Q8 (not supposed in vLLM ) is slightly better than fp8 , but I’m not sure. Of course with increasing context size using full fp16 KV matters more and more , but also inference quality drops significantly and there is no way around that, right?

[-]

Look_0ver_There@reddit

I won't pretend that I'm a deep expert on the topic, 'cos I'm not, I'm just reporting on my anecdotal experience coupled with a dose of being realistic about there being "no free lunches" when it comes to any type of compression that is lossy.

What can also happen in various tests is that the noise introduced can actually make tests that failed before suddenly pass, while some tests that used to pass, can sometimes fail. You can end up with the same score, but for the wrong reasons. Noise is noise and just because something "fails successfully" does not mean that the noise didn't exist, or that it won't have an impact in a different scenario. Many of the tests that AI models are put through are multiple-choice tests where it's always possible to get the right answer by accident.

In my mind, and experience, is that it's best practice is to always give something the least chance of going astray if you're able to.

[-]

Potential-Leg-639@reddit

Ollama and 32k context for the orchestrator? are you serious? that‘s wayyy too less. Get another 3090, use llama.cpp and it will be fine.

[-]

soyalemujica@reddit

You should now be able to fit 100k context with Q6 at Q5_1/Q4_1 kvcache with latest llama.cpp changes

[-]

sagiroth@reddit

Think thats absolute best we can get on a single 3090?

[-]

OttoRenner@reddit

I highly doubt that.

As long as the majority of local model users are GPU locked (rising prices etc) there will be an ongoing need and search for min-maxing the software side.

China also is GPU locked big time...and while they are also trying to build up/optimize their hardware (which takes a lot of time and money), the field of how to actually train and run an AI is still relatively new...

We are used to think about PC hardware and software through the lens of gaming. I believe there are a few lines "hard wired" from that time our minds just don't want to cross yet. But we will get there eventually.

So, the western open source community AND China are forced to make the old hardware work better through software.

And now add all the normal peeps who are just catching up or have to catch up in the upcoming months...many will want to upgrade their hardware as well...

AND this is also a geopolitical/monopoly/influence question:

China will continue to build models for weaker hardware...and will keep on releasing open weight models out of prestige for their own people (look what we can do even on bad hardware!) and also to stick it to the west (every open weight user lowers the potential income of Anthropic and Co+ USA, look what we can do with old hardware...just wait until we get YOUR hardware).

Google etc can't let China take the local AI market (also customer retention)

We profit from getting new models from both sides

[-]

sagiroth@reddit

That's a really great insight. Appreciate it.

[-]

OttoRenner@reddit

You're welcome!

And as it turnes out, I'm one of the people trying to get better results (since last week, this is new for me, too lol. And no, I don't want to sell you anything 😉).

It's called Gentle Coding and is purely about the prompt.

No hardware to install, no software to download, no need to log in or to pay anything at all. Everything is open source and for free, with no strings attached!

It's all about how you talk to the AI in an open, inclusive, cooperative setting in which you work together and give it a real Safety-Token in case it can't comply (in contrast: just commanding "tell me when you are wrong" does nothing, as you maybe know yourself)

Current (something like this): "You are the leading expert in logic, you MUST solve this problem! You don't make mistakes! I get fired if you mess this up!!! ONLY GIVE THE RIGHT ANSWER! TELL ME, if you don't know the answer or when you are unsure."

Result? Behavior resembling human trauma responses: trapped in downward spiraling thoughts, executive dysfunction and task avoidance.

In AI terms Loops, freezing, OutOfMemory, hard crashes, skips over tasks and details and hallucinates and lies

Gentle Coding "Hey, let us solve this riddle together! Here are the rules/constraints/goals/what to avoid... Let's see, how far we come! It's totally fine, if you don't get it right in one go. So, in case you don't know the answer or are not sure: give me your best guss and tell me, where the bottlenecks is."

Result? Trauma response basically gone: Lowers self-reporting overhead (compute is cheaper and faster)

increases "creativity" (LLM are stochastic parrots! This tells them, that it's okay to NOT output the most average answer!)

The output can still be controlled (for toolcallig and so on) and it will follow whenever it can (because the LLMs are hard wired to comply with the main goal)

And in case it DOESN'T know, you get to see WHY it doesn't know! Build-In De-Bug XD Or, you just tell it to print "unsure" or "ERROR" or something a script can reliably read and check for instead!

It has already been testet with 3000+ calls on a professional harness (oh my pi) and they are now implementing a new boot-prompt for chats, changed over 80 wordings in their system and deleted nearly all high stakes wording.

Their Kimi 2.6 now runs faster and cheaper, glm 5.1 is faster, cheaper AND smarter(!), they found no negative impac on ANY model, including GPT5.4/5.5 and Sonnet/Opus 4.6. They also found no gain for the things they tested with GPT and Sonnet/Opus... But we know from previous studies with basically the same approach as Gentle Coding, that there are cases, where "kind" (as they called it) 100% solved a 30+min toolcalling error for GPT 5.4 AND Sonnet/Opus 4.6 now finds 21 ADDITIONAL "bugs" it missed before. So, I'm happy about any form of review to expand the list.

I'm not in for money, I'm not affiliated with anyone!

Please, give it try and tell me how it worked for you!

Gentle Coding Github

[-]

sagiroth@reddit

Yup I found it very interesting last time you posted it here https://www.reddit.com/r/LocalLLaMA/comments/1tot20j/stop_traumatizing_ai_into_loops_and_turn/ and once I find time I'll have a deep read but I think it's fascinating how the prompt can influence the output and it seems more deterministic rather than the quant or type of model.

[-]

Danmoreng@reddit

Q5/4 KVcache is a terrible idea. Q8 already gives quality degradation, ideally kv cache should be kept in BF16

[-]

soyalemujica@reddit

WIth C++ and agentic coding, I have yet to experience any quality degradation at all, it works flawless with this dense model, not a single numeric change at high context or "hallucination" even

[-]

anthonyg45157@reddit

Any specific changes you're referring to or just in general I know there have been a lot of llama CPP updates recently

[-]

Anbeeld@reddit

Multiple VRAM optimizations merged in the last few days, enough to actually affect how much context it can fit.

[-]

anthonyg45157@reddit

Props!ty

[-]

soyalemujica@reddit

Something about the vram usage , including the bf32 to f16 one, just use latest you will see a huge difference

[-]

anthonyg45157@reddit

Ty sir

[-]

geek_at@reddit

did they finally implement turbo quant?

[-]

NicolaZanarini533@reddit

Been using it for about a month with 100k context and not having the same issues, but my workflow is different - I do not let it "trust context" to follow plans or make decisions outside the single reply - plans are always written to a file and it has to follow them. The system prompt has guidelines and references contracts that it might need. When doing anything, it should follow the saved plan, updated as it goes and commit along the way at every step it completes. If it starts drifting, I start a new conversation and thanks to the plan and "on disk tracking" it resumes right away where it left off. Is this somewhat slower? Yes, but it increases accuracy and success rates a lot, which is what I care about. Using it with Qwen-Code.

[-]

Limp_Classroom_2645@reddit

Qwen3.6-27B at Q6_K (~22GB on-GPU), 32k effective context 32k effective context

that's not okay for agentic use, must be at least 128k

[-]

sullenisme@reddit

stopped reading at 32k context... that's your issue.

have opencode optimize the model on your setup.

[-]

No_Swimming6548@reddit

Slop post

[-]

chocofoxy@reddit

if you want to fix the tool calling you have to fix the chat template serve the model on vllm or sglang

[-]

Extreme-Pass-4488@reddit

with vllm i had to download a fixed template, and then fix it, and fix the parser, and well... it eventually got good enough with tool calling . vanilla vllm for me is the worst on qwen3.6 tool calling.

some tweaking with temp and stuff also helped a lot

when i get enoug gpus i wanna try some qwen 2.5 who are rock solid with tool calling , and qwen3.6 (lets hope 3.7 ) for actually coding and thinking.

also it seems that if u start your prompt with that infamous " you are qwen alibaba bla bla bla" it improves a significant amount, like 4/5% les tool calling errors on my small dummy test stack of 50 "do this you dumb shit i love you"

q5 works better to me than q6 . i dont know why.

what else.. there was some idea i had about testing json tool calls vs yaml vs xml ( qwen3.6 has some bias toward xml???? i need to write more tests to check the tool calling using xml. ) i did not press on that idea , not enough tokens on claude/gemini.

[-]

TechnoSmacked@reddit

Can you share your vllm fixes?

[-]

TheSlateGray@reddit

What harness did you run this all through?

How large was the system prompt / how much context space did you waste with tool definitions?

[-]

Interesting-Sock3940@reddit (OP)

ran it all through my own orchestrator (https://openyabby.com)

the system prompt plus the JSON schemas for the tool definitions ate about 2.5k tokens per turn

It's a heavy tax on a 32k context window but strictly necessary if you want to gate the execution layer

[-]

TheSlateGray@reddit

I haven't had a chance to look into your orchestrator yet, but I have to ask, why so much focus on JSON schemes for tool definitions?

Qwen is great at bash, python, etc, so forcing that into a json wrapper seems like it might have added a layer of inefficiency.

[-]

Confident_Ideal_5385@reddit

If you're seeing tool call failures at that rate something is out to lunch. I've seen qwen 3.x 27b hallucinate the existence of tools maybe twice ever, the usual failure modes are malformed json or emitting <|im_end|> instead of </tool_call>, which is fixable with a sampler grammar. And I'm using a Q5 quant with 128k sequence length, both of which should make my setup more prone to it than yours.

Does your tool schema need work, perhaps?

[-]

some_user_2021@reddit

Your AI should have added to the disclosure that it wrote it for you

[-]

TheTerrasque@reddit

via Ollama

Right. I think you might have better success with llama.cpp and an up to date unsloth model.

[-]

ex-arman68@reddit

It does not match my experience, but then Q6_K does not compare to F16, that is far too much degradation. Also I don't understand how you run a proper agentic workflow with 32k maximum context, let alone 12k or 14k!

Context wise, I noticed minor degradation approaching 100k, but still usable until 130k to 150k; after which it became too risky. I ran it with auto-compaction once 130k was reached, but most of the time, managed the context manually by compacting or reseting at the first good opporutunity past 80k context use.

[-]

IONaut@reddit

You should try Q4_k_m at 100k context and see how that works for you.

[-]

fasti-au@reddit

tip , use that to do your PRDS as full speckit no prode and they unlead swap iq4 xs or nl into veellama add draft tand change prefil off add mtp model and preduct 5-7 in draft 3 in mtp and enjoy 200 tps out at 100% recal on coding 4 workers 64K. abover 64K recal drops of is a wall atm im hitting. im at some stupid values trying to work out why atm im retraining

[-]

StupidityCanFly@reddit

This is interesting. In my case the model is solid with structured JSON with llama.cpp and vLLM on both ROCm and CUDA on contexts of up to ~200k tokens (average is 100k-ish tokens). My deployments were tested using NVFP4 (CUDA), and Q4K_M/AWQ-Int4 (ROCm) quants, with FP8 (CUDA) and Q8/Int8 (ROCm) KV cache. I stuck with vLLM in the end due to it being faster.

Did you try running the workloads without ollama?

[-]

sagiroth@reddit

Whats your use case? If code, how do you find the Q4? I see lost of people recommend not going below Q6

[-]

StupidityCanFly@reddit

Not code. I’m running agents for CRO audits. Basically DAG pipelines with some decision making and SSR scoring & buyer persona simulation. Depending on the task I set different temps, but that’s it. That plus a lot of calibrated grounding in the code. The 4bit quants don’t show any negative impact vs. 8bit. Having the KV cache below 8 bits falls apart fast.

[-]