Recommended parameters for Qwen 3.6 35B A3B on a 8GB VRAM card and 24GB RAM?
Posted by FUS3N@reddit | LocalLLaMA | View on Reddit | 21 comments
I was running Q3_K_S with 90k context and was getting 21tok/s and gets reduced to 19.5 something after a few messages (I am using mmproj-F16 as i need vision for some task) And slowly reduces. Any way to get a bit better performance while keeping high context size is that not the issue?
My current params:
llama-server -m model --mmproj mmproj --jinja -fit on -c 90000 -b 4096 -ub 1024 -ngl 99 -ctk q8_0 -ctv q8_0 --flash-attn on --n-cpu-moe 38 --reasoning off --presence-penalty 1.5 --repeat-penalty 1.0 --temp 0.7 --top-p 0.95 --min-p 0.0 --top-k 20 --context-shift --keep 1024 -np 1 --mlock --split-mode layer --n-predict 32768 --parallel 2 --no-mmap
I only started using direct llamacpp recently so i still don't know all the params or what most even do (there's so many) so i just looked up and gathered as much params i could and mashed them together to make the above, don't even know if its the right settings for my setup or if it could be better.
DonkeyBonked@reddit
I unfortunately can't help with the RAM optimization at that level as I'm running a bit bigger version. So I don't know if my settings will really help you at all, but here's where I'm at right now.
Here's what I'm currently using, and not entirely sure I've optimized it well, but it has been testing reasonably well. (I'm still testing this and the one I'm testing is 43.6 GB)
I currently have the context limit in Cline set to 262K, but I'm actually looking to see if I can push that. I noticed at \~200k I had roughly 42/96 VRAM still available, so it's more just how it performs.
My previous settings were:
With my old settings, I noticed some really poor decision making with tools, such as deciding to delete its own resource folders that I had it explicitly instructed as read only and not to modify, which it blamed on stuff that was ridiculous and basically said it was an accident. Some of my previously undeclared like b and ub were automatically defaulted to lower values.
FUS3N@reddit (OP)
Thank you i will mix and match and see what works best for me along with other recommendations
DonkeyBonked@reddit
No problem, not sure if you caught my edit, but:
For you, you should know the -b 4096 -ub 1024 raises VRAM pressure, so when you're operating in tight constraints.
Also: --fit only adjusts unset arguments, it's specifically meant to adjust context, so --fit on, allows it to dynamically adjust. In your case with restricted VRAM, I think I would use --fit on and remove the -b 4096, -ub 1024, and -c 90000
I think with your VRAM it might be good to let it adjust dynamically.
At the very least I would test with and without against one another and see how they handle higher context. My concern would be especially around multi-modal processing close to context limits with the way you have it now.
FUS3N@reddit (OP)
Wait so --fit also adjusts context? And yeah i guess i will remove -b and -ub better to let it handle. Context size is kind of important as for me i have seen with lower i either hit limit or it just hallucinates if some other method was used.
DonkeyBonked@reddit
You should look up all the parameters for --fit, it can be used in a lot of ways, and yes, context is part of that where it can help work with your vram and determine when you offload to CPU.
Of course, I am still working on a bit of that myself. I was testing Qwen 3.5 122B with TurboQuant and I have found a lot of ways to crash my rig, so don't trust my settings there.
I know --fit isn't good at considering the stuff you pre-determine, but I'm not so sure all of the ways to use it that are reliable/stable.
It seems like --fit isn't always the most stable and with declared parameters it becomes less stable. So if like you have your stuff mathed out perfectly, I think --fit may be a burden, but it certainly is helpful for dynamic adjustments and figuring out what works.
ehiz88@reddit
Hey Donkey, where can we read about all the llama parameters? I'm trying to optimize my setup as well, it seems like no one really has the "best" parameters for qwen 3.6 models yet.
Long_comment_san@reddit
I would also try Qwen 3.5 9b dense. Could be a good competition.
EmoticonGuess@reddit
I'm using the "Uncensored" version (because it uses less memory). On my MacBook Pro with 24Gb it managing 60tok/sec
FUS3N@reddit (OP)
Would i get better speed though other than having more space for context, is it generally better to run dense model and offload to cpu or run a moe like this model, because i can still fit q4 of 9b i think on my gpu but idk if its the best considering how much context i would be needing or in terms of quality
DunderSunder@reddit
I can fit Qwen3.5-9B-UD-Q4_K_XL with 70k context I think.
Long_comment_san@reddit
90k context is quite brutal on 8gb vram. Look into turbo\rotor quant models, like gemma 4 26b (qwen 3.6 should also be coming out as we speak).
Protopia@reddit
lk_llamacpp is supposed to be better for small vRAM and hybrid GPU/CPU inference.
FUS3N@reddit (OP)
After going through a lot of stuff i managed to build ik_llamacpp for cuda on windows and it was slower for me went down to 17t/s I mostly used same params, couldn't use fit ad n-cpu-moe together but tried both separately still same issue, not sure if i am supposed to give it more specific params to make it better.
Song-Historical@reddit
I would also like to know what a good context window and workflow I can manage for coding on hardware like this.
DonkeyBonked@reddit
Context really determines the code base you're working on.
I can tell you with what I'm working on right now, the OPs context of 90k would not work for me.
My advice would be if you're on hardware that is restrictive, until the main fork of llama.cpp officially works with TurboQuant, use a fork that does support it and try TQ_3 to push the context as far as you can go.
Context is really important as your code grows, because if it can't look at the whole chain of the code it's working on, then it has to hallucinate what it thinks that missing code does. Sometimes it has enough of the logic in memory to do okay, but it will always perform better the more of the code it can see.
I'm currently finding the 262k ceiling a bit too low for me and looking to test opening it up and pushing it further.
If you just want like in-line code completion or occasional script writing, lower is fine, but if you want it to be able to really contribute, let alone write larger code, you really have to consider what you'll be looking at in the end, because a 10k lines of code app and a 50k or 100k+ line app are not the same challenge for AI to work with.
Song-Historical@reddit
Yeah this stuff is moving so fast it's getting hard to commit to a workflow. I didn't know those forks existed, a lot of the llama.cpp stuff I haven't attempted yet but I guess I have to.
Do you know if there's a guide somewhere you'd recommend for someone with some programming and high level understanding of what's going on to be able to really understand the nomenclature and get started fine-tuning some models and working with different quantization/dequantization and doing text embedding etc? I started Karpathys Zero to Hero course but it'll take a while for me to understand exactly what the limits are and what I should be trying.
DonkeyBonked@reddit
I would start with Unsloth, they have been the best for me, but I won't pretend I've been able to keep up with it all myself. As it is I'm spending hours every single day taking it all in, and while my understanding of what I take in is perfectly fine, there are so many things moving in so many directions that it's like dropping a bag of marbles on the ground and trying to keep track of which way they're all rolling. Some you can spot dead ends, but some of the paths are just surprising.
Like I honestly thought when Nemotron 3 dropped, their architecture for context efficiency was going to dominate, and I'm still waiting to see where Nvidia takes that, but then Google dropped TurboQuant which is frankly incredible from what I've been able to test, and so now I'm not so sure, but llama.cpp has been very careful so they still haven't even adopted TQ support into the official branch.
For training, I would 100% start with Unsloth, start with attempting to fine tune very small models, like under 1B, because you'll learn from it and you want that learning to be faster than waiting forever and finding out days later that what you did broke the model's ability to speak English or something like that.
https://unsloth.ai/docs/get-started/fine-tuning-llms-guide
I personally find that using (q)(re)(si)LoRAs is probably a better start than outright fine tuning them, which you can also learn and do with Unsloth. Then I would go from LoRAS to fine tunes and depending on how familiar you are with these, look at the bridging here to things like MCP servers, RAG databases, and skills.
They all really interconnect, so the direction you want to go will be very use case specific.
But just as an example, I am using my own custom database format for storing my training data that I made specific to my use case. I have data sets I've curated from a mixture of data available on Huggingface to my own personal code base to engine specific data. If you need to have general knowledge a model is bad at, fine tuning may be the path, if you need the model to know when to use that knowledge, you may need to add experts to MoE models, but then when you need that model to follow certain rules more tightly, such as following version history with an engine or trying to avoid deprecated code, a properly built LoRA with the right meta and hard structured data paths is typically better, and then when you need it to work with your code that's too big or too complicated for the model to just ingest in one shot, or you're working at a scale unforgiving for model's normal error rates, it's a good idea to tie it into a RAG database or if you want it to be able to work specifically with an evolving API, then an MCP server is likely going to be more appropriate.
This area, especially when you're using or customizing models for coding, is a bag of cats. There's so much to take in that you could just plug yourself into tech feeds like Neo learning the Matrix and you'd find it's never safe to come unplugged, because the moment you do and you focus your attention on one area, ten others just made path changes you now know nothing about.
Just life in the rapidly evolving landscape of emerging world changing tech. In some ways, this is like the modern high-tech version of the .com boom where the internet exploded and search was evolving barely keeping up with the evolution of those exploiting it and web technologies were going in every direction while java was a weapon and flash was a vulnerability people thought was going to be the future... but then, when you think about it, and you follow this all back to BERT, you realize this is really just the same thing, we're just following the 2019\~ deviation of search technology.
Either way, it's one heck of a ride.
Longjumping_Virus_96@reddit
I would aim for a 10 ts with q4 quantization
AVX_Instructor@reddit
im using `Qwen3.6-35B-A3B-UD-IQ3_XXS` and this works pretty well on my RX 780M and 32 gb ram (i get 200t/s for pp and 20-25 t/s in output)
Icy-Degree6161@reddit
Curious what your experience with that quant is. I have the same specs on my minipc, but never went below iq4-xs even if it gets a lot slower.
AVX_Instructor@reddit
Simple Python/Bash scripts—the agent is also in OpenCode—can write, sort files, or search for information online. Tool calls work without problems in most cases (95% success rate, according to tests). I can't run Quantum higher on my system, because processing speed drops by 30-40 percent with a slight increase in quality.