Strix 4090 (24GB) 64GB ram, what coder AND general purp llm is best/newest for Ollama/Openwebui (docker)

Posted by AcePilot01@reddit | LocalLLaMA | View on Reddit | 35 comments

Hello,

I was using coder 2.5 but just decided to delete them all, I MAY move over to llama.cpp but I haven't yet and frankly prefer the GUI (although being in docker sucks cus of the always having to login lmfao, might un do that too)

I am looking at qwen3 Coder next, but not sure what others are thinking/using? speed matters, but context is close as is accuracy and "cleverness" so to speak, ie a good coder lol

The paid OPEN ai one is fine, what ever their newest GPT is, but im not subbed right now and I WILL TELL YOU it is TRASH for the free one lol

[-]

NoShoulder69@reddit

localops.tech has a good list of what fits on 24gb - filter by your vram and it shows compatible models. way easier than guessing

[-]

pefman@reddit

im currently running this model 4090/14700k 64gb.

using a wrapper, my config is.

exec "$SERVER" \
    --model "$model" \
    --alias "$model_name" \
    --host 0.0.0.0 \
    --port 8080 \
    --n-gpu-layers -1 \
    --parallel 1 \
    --threads 8 \
    --threads-batch 8 \
    --ctx-size 200000 \
    --batch-size 1024 \
    --ubatch-size 512 \
    --flash-attn on \
    --jinja \
    --temp 0.7 \
    --top-p 0.9 \
    --min-p 0.05

im getting about 28t/s and utizlizing 22.8/23,9GB vram

any suggestions would be appriciated!

[-]

AcePilot01@reddit (OP)

Hmm, odd, wonder why mine is nearly half. Where do you get your measurements? For me, it's jut the output at the bottom of the server hosted web-ui (not openweb) and when I am getting a response I see around 14.5t/s

Just want to make sure comparing apples to apples

[-]

pefman@reddit

im also running very low wattage. like 100... that might be a good thing though

[-]

pefman@reddit

ahh well i used the chat ui and asked it to write me a book. figured that would be as good as any meassurement :D

[-]

Trick-Force11@reddit

Qwen3 Coder Next in with the unsloth dynamic Q4_K_XL gguf is your best bet here. You will have to offload but im sure your fine with that as it will still give good speeds as a 80B A3B model

[-]

AcePilot01@reddit (OP)

a3b? I can never keep track of all the different named versions of the 10000 different versions of the inf more many number of models out there lol.

im a bit lost on the "in with the unsloth dynamic q4...)

[-]

_aelius@reddit

80b a3b - this means its an 80 billion total param model with but the active layer is only comprised of 3 billions parameters. This model architecture is called "mixture of experts" or MOE for short. It's relatively new and it allows larger models to perform better(or at all) on consumer hardware.

unsloth is a team/company that do fine tuning and quantization of many popular models. like qwen3 coder next. Whenever a new models comes out, many people eagerly wait for unsloth to release their gguf variant.

[-]

AcePilot01@reddit (OP)

OH ok i almost grabbed the 30b then haha, I SEE then, Ok but then maybe im confused. maybe that s the 30b one?

what's the 80b one then? how would you comare running the q4 or 5, of the 30b fully on card vs that 80 slightly offloaded?

[-]

_aelius@reddit

The 30b is based on Qwen3 and the 80b is based on Qwen3 NEXT, Qwen's latest model architecture.

Honestly, I'd try them both.
With the 80b model try experimenting with the `--n-cpu-moe` flag in llama.cpp.

I can't speak to the differences between those quants. If its the difference between fitting a model 100% on GPU or not, its probably a big deal. I think people consider q4 to be the sweet spot between accuracy and performance.

[-]

AcePilot01@reddit (OP)

-n as in a number, or that command exactly?

GGML_CUDA_GRAPH_OPT=1 \ ~/llama.cpp/build/bin/llama-cli \ -m ~/llama_models/qwen3-coder-next-80b/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \ -ngl 24 \ --threads 26 \ --n-cpu-moe 28 \ -fa on \ --temp 0 \ --cache-ram 0 \ --color on

is my current, I just bumped it up 2 on both ngl and threads.

[-]

tmvr@reddit

Remove the -ngl and the --n-cpu-moe parameters and use --fit-ctx instead with a value for the context you need. For example --fix-ctx 332768 for 32K context etc. Thar will distribute the layers, KV and context the optimal way between VRAM and system RAM.

[-]

AcePilot01@reddit (OP)

oddly a guy on here (windows at least) had -ngl 99 somehow and was getting faster speeds,

https://www.reddit.com/r/LocalLLaMA/comments/1qz5uww/qwen3_coder_next_as_first_usable_coding_model_60/

Sure that's for windows, but converting that over for my llama.cpp on linux, btw, how can I "configure" it, is just those command lines? or can I do any other config? what's the default context? how can I compare that context to say, the paid GPT one? or it's default?

[-]

tmvr@reddit

Not sure what do you mean by "getting faster speeds". The -ngl 99 simply means pack everythting into VRAM. In case of a 24GB VRAM card this will overspill into system RAM (on Windows automatically). The -fit and --fit-ctx parameters prioritize the content that will benefit more from being in VRAM to go into VRAM. As for how to use it, I mean just modify the command as I've said so it will look like this:

GGML_CUDA_GRAPH_OPT=1
\~/llama.cpp/build/bin/llama-cli
-m \~/llama_models/qwen3-coder-next-80b/Qwen3-Coder-Next-UD-Q4_K_XL.gguf
--threads 26
--fit-ctx 32768
-fa on
--temp 0
--cache-ram 0
--color on

That 32K for context was an example model supports up to 256K (262144) so you can try various values depending on what fits your needs. Of course the more context you use the more VRAM will be used and more expert layers will be pushed to system RAM resulting in slower token generation.

I don't know what context size GPT has tbh, never cared that much, but if you use Claude Code that defaults to 200K so that is what you get with with Sonnet or Opus.

[-]

AcePilot01@reddit (OP)

GGML_CUDA_GRAPH_OPT=1 ~/llama.cpp/build/bin/llama-cli -m ~/llama_models/qwen3-coder-next-80b/Qwen3-Coder-Next-UD-Q4_K_XL.gguf --threads 26 --fit-ctx 32768 -fa on --temp 0 --cache-ram 0 --color on

No reason to use a higher ngl? or without it it will do the same thing? how can I easily check how much vram id need for aa given context?

lastly, any reason for not using the llama-server? (I heard it manages the ram better?)

[-]

tmvr@reddit

There is no reason to use the -ngl switch at all if you are using -fit or --fit-ctx.

Didn't even notice you are using the cli, I'm using llama-server to access the API endpoints remotely as well. For quick tests is also has a built-in web UI.

There is a formula how to calculate it, but based on the level of questions I don't think you should worry about it. You can also have a quick look in LM Studio, when you load a model you have the options to manually adjust the parameters and on the top it shows you the memory requirements so you can just pull the context slider and see what the difference is as you go higher.

[-]

AcePilot01@reddit (OP)

So I actually found something interesting, my tokens/s went up if I lowered my threads to 18, which apparently is a sweet spot. May be a bug in llama.cpp. I am at about 15t/s when generating at least simple code.

GGML_CUDA_GRAPH_OPT=1 ~/llama.cpp/build/bin/llama-server \ -m ~/llama_models/qwen3-coder-next-80b/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \

-c 32768 \

-fa on \

-ngl 23 \

--threads 18 \

--temp 0 \

--cache-ram 0 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--host 172.17.0.1 \

--port 8080

With discord and firefox (a few tabs at least) open, I can't get to 24. I bet if I closed them I would get a few more on there, WHICH eh, I amy be ok giving up for that, just so I can DO something while it is gen, as a few more tokens a sec isn't going to make or break it I guess.

I do wonder if mentioning the type of cache is helpful or not. I may test that later. So for now Ill stick with that lower context. but I may bump it up now and see, but I haven't even used a full 32k yet, so ill increment that as I need to and see if it causes issues when i get to that size.

[-]

AcePilot01@reddit (OP)

Btw, the level of questions was mostly me begin pulled in a few directions at the time, tired and frankly a tad exhausted hha, so they DEF came out a bit "smooth brained lol"

I still like to try to be effective and efficient here, so I would be curious how that formula works, by no means am I expecting you to explain it, but finding the best params would be ideal I think, but also from a curiosity standpoint (sometimes diving DEEEEEP into things, even though overwhelming at times, is more exciting and motivating to sift thru a bit and learn some) ADHD and all.

[-]

Look_0ver_There@reddit

There's an older Qwen-Coder model that is 30B released mod-2025-ish, and then there's a newer Qwen-Coder-Next that's 80B and was released about 2 weeks ago

[-]

AcePilot01@reddit (OP)

ok got it installed and still working on which commands/methods to run it with. (the 4bt 80b)

[-]

AcePilot01@reddit (OP)

on their hugging face I only see the

Qwen3-Coder-Next-GGUF

Qwen3-Coder-Next-F16

Qwen3-Coder-Next-Q4_K_M

Qwen3-Coder-Next-Q5_0

Qwen3-Coder-Next-Q5_K_M

Qwen3-Coder-Next-Q6_K

Qwen3-Coder-Next-Q8_0

[-]

p_235615@reddit

for 24GB you want to look at magistral:24b, devstral2:24b, qwen3-coder:30b and glm4.7-flash:30b all of them with q4 quantization.

[-]

ABLPHA@reddit

Why is everyone trying to cram A3B models fully into VRAM? Qwen3 Coder Next runs at 20t/s at UD-Q6_K_XL with experts on CPU, consuming mere 11GB of VRAM with full-precision 262144 tokens of context

[-]

p_235615@reddit

Because there are significant speed penalties if you dont. For example gpt-oss:20b usually fits fine to my 16GB VRAM, its doing ~80t/s on my HW. When I also loaded a whisper model first and just 2 layers of gpt-oss:20b went to RAM, I got only 23t/s. Thats a drop of almost 3/4 of the inference speed. It is usable ? sure, but the wait times got quite annoying.

My server is still at 2666MT DDR4 + older CPU and such larger 80B MoE models usually drop to <10t/s. And that is totally useless for anything interactive.

[-]

tmvr@reddit

Yes, the speed drops considerably when most of the expert layers are in the system RAM, but getting 25-40 tok/s from that 80B (or also from gpt-oss 120B) model is still a far cry from getting low single digit tok/s from dense models that spill over to the system RAM.

[-]

p_235615@reddit

well, on my home "server" with Ryzen 3600, 64GB of ECC 2666MT RAM and a RX9060XT 16GB its down to single digits with qwen3-coder-next:80B, despite it fitting in RAM+VRAM and no swapping...

On a more recent system it could be faster, but you getting still a severe speed hit. I have access to a workstation with Intel 285K, 128GB RAM and RTX 6000 PRO 96GB, where you can load the full gpt-oss:120 and its doing 182t/s and the qwen3-coder-next is doing 115t/s. So still, at 25-40t/s you are still getting 1/4 of speed or less. I tried some 200B+ MoE models there, but they are also down to ~20t/s range, which is fine for single user non interactive, but that system is serving multiple users, so the inference speed has to be quite high so its not a pain to use.

[-]

tmvr@reddit

I've just tried the new llamacpp build:

https://www.reddit.com/r/LocalLLaMA/comments/1r4hx24/models_optimizing_qwen3next_graph_by_ggerganov/

The improvements are nice. It gets 43-46 tok/s with a 4090 and DDR5-4800 RAM depending on the context size. Starts off at 43 tok/s with 128K context.

I have a second machine with 2x 5060Ti 16GB, but I can't replicate your config unfortunately even with limiting CUDA to a single device because I only have 32GB RAM and that's not enough for the Q4_K_XL version. I'd have more VRAM bandwidth (448 vs 322 GB/s) but lower system RAM bandwidth (2133 vs 2666 MT/s RAM), but I would still expect around 20 tok/s performance there.

I don't do multi-user so the performance is just for me, and yes, it is very easy to get used to the 180-200 tok/s with Qwen3 Coder 30B, but I still find it OK to use gpt-oss 120B with 25 tok/s and that one has thinking. At least with Qwen3 Coder Next it does not "waste" time/tokens on thinking so that 43-46 tok/s is even better than it would be with gpt-oss 120B for example.

[-]

AcePilot01@reddit (OP)

how can I check if I have the new build I just installed llama.cpp but I think it was from a repo

[-]

tmvr@reddit

Just download the release you want from here:

https://github.com/ggml-org/llama.cpp/releases

b8853 is the one that has the speed improvements for Q3 Coder Next.

[-]

AcePilot01@reddit (OP)

that's fine if you are just talking to it, but the moment you have it parse and then actually code, it can take a few mins to generate a few hundred lines.... fast for a one time thing here or there, but if you are making edits, etc. That's going to add up to a notable % of your time tbh.

Just ask it to "make a game" and see how long it takes to get the full code out.

[-]

AcePilot01@reddit (OP)

actually it seems to be running fine tbh, at least for day to day stuff, (im not a coder so it isnt work either) althoguh no real comparison yet

also haven't tweeked how im running

[-]

Whiz_Markie@reddit

What tokens per second are you getting?

[-]

AcePilot01@reddit (OP)

whats the best way to check you think? It was fluctuating a bit based on what I asked it, prompt could go to 300 or less, but the reply seemed to be around 10? didnt test too hard yet tho, also have not optimized anything yet

[-]

p_235615@reddit

since you used qwen2.5-coder in the text, I assumed its mostly for coding... For chatting and general stuff you should probably go with the qwen3-next non coder version... Its a bit better at general conversation stuff.

[-]

zpirx@reddit

With a Q4 quant it runs fine on a 4090 (~30 t/s). Haven’t tested the latest llama.cpp build yet but it should be 10-15% faster for Qwen3 Next. And right now it’s easily one of the strongest models for coding and general use.