125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

[-]

PairOfRussels@reddit

I've been trying to get a project going that would actually build something useful. Even with qwen 27b q5 it still struggles to complete a simple Android app and server. And trying to build a tdd continuous delivery pipeline is also not working out. So the output is half cooked and the process is sloppy weeks later.

So if your setup is building something useful can you tell that story? I'd even take 10tps if it built something well.

[-]

miversen33@reddit

Depending on your hardware, I've going that Q6 is pretty solid with 27B. I'm personally using Q8 with Q8 KV cache to build, and Gemma 4 26B Q8 to plan. For "useful project", those 2 together (with me guiding of course) have have converted my entire disparate (and manually managed) infrastructure to IaC (though Claude built the terraform pipe itself). As of just this morning I had it build and configure a brand new forgejo runner for my forgejo stack (which it also built out and automatically integrated with both my Authentik SSO layer and my reverse proxy).

You can do cool things with the higher quant models :) but it does seem to require quite a bit of scaffolding to get there, which I had to build with Claude.

I tried to release this same stack which works well with my IaC project it built, on a project I'm building (that I started before I had any of this running) and it can work but it struggles a lot with missing context and more fully understanding everything.

I know that's a me problem, but Claude has been able to help quite a bit from the same starting point so I still need to identify the missing holes.

In any case, Qwen 3.6 27B is an amazing model, especially mixed with MTP and higher quants

[-]

r00x@reddit

What hardware are you running those models on though? I can't get away with much more than Q5 anything in the 27-35B range before I'm running out of VRAM.

[-]

miversen33@reddit

For Q8 at 128K Q8 KV cache, I was able to comfortably run on dual 7900XTXs. I am currently running Q8 27B at 256K Q8 KV cache across 3 7900XTXs right now.

My setup is designed to only run 1 model at a time and only run one slot at a time. So I do need to hot load models if I need to use Gemma 4 along side which adds some latency (currently loading up a model plus it's context takes a few minutes). I've got a few P40s that I would really love to use as well but they kinda suck (in relation to the 7900XTX) so they've been relegated to tasks where performance doesn't made (or other GPU tasks entirely)

[-]

Glittering-Call8746@reddit

Which backend are u using ? Vllm ? Settings pls

[-]

miversen33@reddit

llama.cpp. I have done a reasonable amount of testing to end up where I am currently. Below is my dockerfile (running a custom version of llama.cpp called Lemonade), docker compose service and ini file. May be useful, may not. Either way, enjoy lol

Dockerfile
FROM ubuntu:24.04

RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates \
    gcc \
    unzip \
    wget \
    && rm -rf /var/lib/apt/lists/*

# Create video and render groups to match host-side GIDs for GPU device access
RUN groupadd -g 44 video || true \
    && groupadd -g 104 render || true

WORKDIR /opt/llama-cpp

RUN wget -O /tmp/llama-rocm.zip "https://github.com/lemonade-sdk/llamacpp-rocm/releases/download/b1269/llama-b1269-ubuntu-rocm-gfx110X-x64.zip" \
    && unzip -o /tmp/llama-rocm.zip -d /opt/llama-cpp \
    && rm /tmp/llama-rocm.zip \
    && chmod +x /opt/llama-cpp/llama-bench \
    && chmod +x /opt/llama-cpp/llama-cli \
    && chmod +x /opt/llama-cpp/llama-server

ENV LD_LIBRARY_PATH=/opt/llama-cpp:$LD_LIBRARY_PATH

ENTRYPOINT ["/opt/llama-cpp/llama-server"]


Docker Compose
services:
  llama-rocm:
    build:
      context: .
      dockerfile: Dockerfile.llama-rocm
    image: llama-lemonade-custom:rocm-b1269
    container_name: llama-rocm
    restart: unless-stopped
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
      - render
    ports:
      - "8080:8080"
    volumes:
      - type: bind
        source: /opt/models
        target: /opt/models
        bind:
          propagation: shared
      - type: bind
        source: /opt/llama-cpp/models-rocm.ini
        target: /opt/llama-cpp/models-rocm.ini
        read_only: true
      - /var/llama-cpp/rocm:/save
      - type: bind
        source: /opt/local-llms/models
        target: /opt/local-llms/models
        read_only: true
    environment:
      HSA_OVERRIDE_GFX_VERSION: "11.0.0"
    mem_limit: 8g
    memswap_limit: 8g
    command: >
      -ngl auto
      --sleep-idle-seconds 3600
      --host 0.0.0.0
      --port 8080
      --reasoning on
      --kv-offload
      --slots
      --metrics
      --slot-save-path /save
      --models-preset /opt/llama-cpp/models-rocm.ini
      --models-max 1


Models Rocm
#SPDX-License-Identifier: MIT-0
[*]
jinja = on
ctx-size = 16384
batch-size = 2048
ubatch-size = 2048
cache-type-k = q4_0
cache-type-v = q4_0
flash-attn = on
cache-prompt = true
threads = 16
#checkpoint-every-n-tokens = 8192
#ctx-checkpoints = 128

# ---------------------------------------------------------------------------
# Qwen3.6 - Dense general chat, always 2 GPUs (22 GB weights)
# ---------------------------------------------------------------------------
[Qwen3.6-Dense-MTP:Rocm]
model = /opt/local-llms/models/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q8_K_XL.gguf
ctx-size = 262144
tensor-split = 1,1,1
batch-size = 12288
ubatch-size = 512
parallel = 1
temp = 0.6
min-p = 0.0
top-p = 0.95
top-k = 20
repeat-penalty = 1.0
presence-penalty = 0.0
reasoning-budget = 5120
spec-type = draft-mtp
spec-draft-n-max = 2
cache-type-k = q8_0
cache-type-v = q8_0

[-]

GroundbreakingTea195@reddit

I have the exact same setup and Vulkan is working way better for me. Any experience with that?

[-]

miversen33@reddit

I found that lemonade performed ~60% better than llama.cpp (ROCm). In my same testing I found that llama.cpp (ROCm) was slightly more performance than llama.cpp (vulkan).

I should try it again though

[-]

GroundbreakingTea195@reddit

Thank you! I tested your configuration:

ROCm looks clearly faster than Vulkan for prompt processing on my 2x RX 7900 XTX setup with Qwen 27B Q4_K.

In "llama-bench", ROCm did around 541 t/s at pp128, 780 t/s at pp512, 846 t/s at pp2048, and 1081 t/s at pp4096. Vulkan was around 432, 588, 606, and 593 t/s for the same prompt-processing tests.

So for context ingestion, ROCm is roughly 25–80% faster depending on prompt size, with the gap getting much bigger at larger prompts.

For token generation, my earlier "llama-server" test had both backends much closer: Vulkan was around 38.3 t/s and ROCm around 38.0 t/s on a 500-token generation. So generation speed seems basically tied so far, while ROCm clearly wins on prompt processing.

I still want to do more proper testing, especially with longer contexts and cleaner token-generation benchmarks, but I didn’t have enough time yet.

[-]

miversen33@reddit

Awesome! To be clear, the perf uplift here is from lemonade which is a (as I understand it) a nightly build of llama.cpp with specific amd optimizations. It is at least partially maintained by AMD employees. Why they haven't up streamed the optimizations to llama.cpp directly, I am unsure.

But ya I found the biggest uplift was on prompt processing. In generation there wasn't a huge difference but that's fine because prompt processing is where the pain is anyway

[-]

ManySugar5156@reddit

3x 7900XTXs is kinda insane ngl. hot loading models sounds annoying as hell tho

[-]

miversen33@reddit

Lol 48gb of VRAM (2 24Gb cards) is certainly enough to run a single Q8 27B if you accept 128K kv cache as your limit.

I can't speak to the hardware you specifically have but it seems most people shoot for 24Gb of vram per card they are using.

[-]

Tartooth@reddit

You trying to have it do it in a one shot or you spec'ing out the project and feeding it peicemeal

[-]

PairOfRussels@reddit

I gave it simple high level specs then started filling in details in small incremental stories. It doesn't do well with workflow even with subjects and all that in opencode. it did better when trying to do large segments as one shots, but when trying to get it to do tdd or work via a CI pipeline it fell apart.

[-]

TheDukeDaniel@reddit

to me it doesn't make sense to buy hardware for Qwen3.6 27B when deepseek v4 flash has better coding bench and 1m context window for basically .42 cents aggregate 1M tokens input/output. over 30 days with ive only spent about 26$ mix usage between pro and flash. lets say you buy a 3090 24gb Vram 1300$ thats takes you 50 months to recoup yours costs. If you have older hardware the math is different but just doesnt make sense to me to use 27B anymore and struggle with low 30-50 TPS

[-]

Confident_Ideal_5385@reddit

Depends on your usecase to a degree, too. With an API model - you're locked into their chat format/parser - you have minimal control over sampling - you have no ability to attach low rank adapters or otherwise adjust the weights - you have no ability to do grammar constrained generation (which i guess is just a special case of sampling, but eh)

If you're fine with these constraints and don't mind the API provider training on your input data, APIs may make sense.

[-]

miversen33@reddit

The hardware required to run Qwen3.6 27B vs Deepseek V4 flash are extremely different.

Your argument is basically "why self host when you can run in the cloud?". And ya, it's a valid argument, but not one that will get much support on a subreddit dedicated to running models locally

[-]

TheDukeDaniel@reddit

very true, i'm not one to talk I have a MS-S1 MAX 128gb with Aoostar AG3 TB5 + R9700................. so i could be called a hypocrite lmao

[-]

PairOfRussels@reddit

2 things...
1 - my usage patterns change when you're paying by the "token" vs unlimited usage.
2- I guess I just don't like being the product. Your data and work becomes part of that host's IP and future model training with that option, doesn't it?

[-]

Chuyito@reddit (OP)

> Your data and work becomes part of that host's IP

+1. The reason why many choose or need to self-host. For 50-80% of my day to day quant coding, qwen 3.6 + open webui replaced the need to go to a frontier model for the month of may.

This isnt so much about getting hermes to play snake or tetris without bruning $500M in tokens, sure thats fun and all.. its about private but useful LLM for more sensitive IP.

[-]

TheDukeDaniel@reddit

yea privacy is a big part I agree, but does that mean deepseek has copyright claim to all info used with the API. I doubt it

[-]

NineThreeTilNow@reddit

Your data and work becomes part of that host's IP and future model training with that option, doesn't it?

This depends on who serves the model.

Via API most Western labs will not train on data. This is for a large number of reasons.

Kimi explicitly says they will train on data in their privacy policy.

If Kimi is delivered via a 3rd party that doesn't train on data, that's different.

I don't know the specifics of Deepseek but it's the same if hosted by a 3rd party typically.

These 3rd parties normally host at a MASSIVE cost increase though.

People using Gemma 4 31b can sign up with Google directly to use it via API without training from what I remember. You might double check their data policy but I'm fairly certain API doesn't apply. Their "free" usage or usage with a credit card on file are pretty generous. Then it's not quantized and runs fast off Google's servers.

One reason a lot of these labs have turned off training on API is benchmarking issues. There are a lot of 3rd parties benchmarking these models and they need assurance that the benchmark isn't in training.

For benchmarking the open models, the people doing the benchmarking typically use the other providers mentioned.

[-]

TheDukeDaniel@reddit

I guess but to be fair unlimited usage and .48cents is almost the same thing. But I agree about being the product. If you have to keep things offline for privacys sake. I would go qwopus 3.6 35B + mtp. As long as you have memory bandwidth of 400GB/s you'll get at least 100 TPS. And the swe bench is the same as qwen 27b Q4 i believe. Then have qwen 27b Q6 quality check the code afterwards over night. But you would need a B70 or R9700 which puts u back in the same money situation position of performance/dallors

[-]

baseketball@reddit

what coding harness are you using with deepseek

[-]

TheDukeDaniel@reddit

Claude desktop and hermes

[-]

Evanisnotmyname@reddit

Don’t you need a subscription to use Claude desktop?

[-]

TheDukeDaniel@reddit

well yes and no, you can alter claude desktop to use a local LLM or use an API. you just have to change the environment variable in the config.json. it takes quite a bit + developer mode but i just linked a couple githubs to hermes and told it where the application folder was and it did it for me.

[-]

tamerlanOne@reddit

Se la privacy non è un problema può essere una scelta vincente

[-]

LORD_CMDR_INTERNET@reddit

At least Q6 and no quant kv cache and the output is extremely high quality. otherwise you’ll be fighting with it constantly

[-]

PairOfRussels@reddit

Ok. I was running on a single P40 with Q5 and barely having any room for context... Q6 wasn't an option really that way. I am going to give it a shot now with Q6 tensor-split between my 3080 and P40. I should get only 12t/s generation but if it's QUALITY then I'll live with it. Once my app earns $5000 I'll buy better hardware...

[-]

TheDukeDaniel@reddit

That's the thing is the roi on llms is wicked low. The chances of making back your funds is bad. You almost have to have a business plan set up first before you dive in...... Or just say F it and spend all the moneys lol

[-]

Client_Hello@reddit

27b Q6_K and f16 kv, 96k context, claude code

Built a markdown to docx conversion utility to help with publishing things to sharepoint.

Created lots of scripts for automating parts of workflows. A lot of modernization things, like replacing file copies or smtp with API calls, adding logging, securing passwords.

Created an API reference guide entirely out of python scripts for an app with poorly documented API, then used that to make a new app.

[-]

Ok-Measurement-1575@reddit

Quanting kv cache?

[-]

jaybsuave@reddit

ngl i use gemmae4b and its really really good

[-]

Chuyito@reddit (OP)

There's a ton of data that I work with for my tiny startup that I cant just drop into a frontier model.

For me I'd say from llama up to qwen 3.5 was more of a tinkering hobby.. but not productive useful.

For all of May I used qwen as my daily driver, and fell back to gpt/grok for maybe 2/20 services that I had to patch... E.g. things like schema evolution, where you have some python script reading some API.. and they change the response. local llms can finally handle it.

And I dont mean hermes or claw, i wouldnt let those near prod with a 10 foot pole. I mean being able to actually have an offline llm that is useful

[-]

Nnyan@reddit

one of the reasons I haven't build my own LLM server is the contrast between hardware performance. I figure I'll build something small first as a POC and something to learn how to put one together (software not the hardware). I was thinking dual 5060Ti 16gb or dual V100 16gb pcie (or even a single V100 32gb pcie). While I know that the V100 has limited support for modern options and is older hardware it still seems to perform pretty well against lower more modern hardware at the lower end of things. I'm not doing any rocket science just some coding and RAG.

[-]

tmvr@reddit

You should just go for the dual 5060Ti 16GB if you find some for good price and have a board to put them in. The whole maturity "debate" for Blackwell is for the NVFP4 support which is realistically irrelevant for you. It already worked fine with the older llama setup and GGUFs, but you didn't really make use of the dual compute due to the lack of tensor parallel processing in llamacpp, which is not fixed. You could do that as well before using vLLM, but llamacpp is much more user friendly tbh.

[-]

Nnyan@reddit

Thanks! I'll keep my eyes out for any decent deals on the 5060ti.

[-]

kiwibonga@reddit

I have 2x 5060Ti so similar boat, and I just got the 610 drivers with 13.3 working, meaning I should finally, theoretically, have NVFP4 and MTP. And right now I want to murder my computer. Every path still has a bug. I have the most vanilla shit ubuntu distro and the most vanilla current gen GPUs. HOW IS IT ALL STILL SO BROKEN.

[-]

LORD_CMDR_INTERNET@reddit

It’s your quant. Q6+ for this model and don’t quant your cache. It performs extremely well but severely degrades at less than that

[-]

Force88@reddit

Even q8_0 cache is still bad?

[-]

LORD_CMDR_INTERNET@reddit

yes, any kv cache quant is a severe degradation

[-]

tmvr@reddit

Well, don't quant the KV for Gemma, but for Qwen the q8_0 for both is a nonissue.

[-]

Long_comment_san@reddit

Thats actually amazing way to phrase it.

[-]

kiwibonga@reddit

Wish I could get far enough to see a token but still-not-working vllm has decided it needs more than 32 GB of RAM and 32 GB of VRAM and all my swap to serve a 19.6 GB model on 2 GPUs.

[-]

FullstackSensei@reddit

I find it funny how so many people cite NVFP4 as a reason to pay a premium for Blackwell over much cheaper Ampere cards, yet ignore how broken NVFP4 support is everywhere, including CUDA.

Your setup is not vanilla in the sense that consumer Blackwell is currently very much an afterthought for Nvidia. Pascal, Turing, Ampere and Ada are way more vanilla, because all the bugs were ironed out years ago, when consumers still mattered.

[-]

Long_comment_san@reddit

NVFP4 has been the AI waifu all along

[-]

Fit_Split_9933@reddit

nvfp4 is OK. When I used qwen3.6-27b-nvfp4, the speed of PP increased by about 60%, while the speed of TG increased about 5%. Hopefully there will be optimizations in the future.

[-]

Xp_12@reddit

nvfp4 is fine. it's mostly quantization recipe issues affecting kld and acceleration not being fully supported on sm_120. good luck finding a good nvfp4 w4a4. might find some okay w4a16.

[-]

jtjstock@reddit

Nvfp4 is dogshit on llama right now

[-]

10F1@reddit

I'm getting 110tps on 7900xtx.

[-]

migsperez@reddit

Q4 and Mtp, does it generate decent output? Is the quality good?

[-]

ea_man@reddit

MTP has nothing to do with quality.

[-]

shrug_hellifino@reddit

Unless... it forces you to go to a lower quant due to a constrained system. So indirectly I can, but it is still 🤌

[-]

ea_man@reddit

MTP doesn't force you, you implement that.

MTP generates cheaper token that must be validate by the main model: if tose are the SAME they go in if they are not they stay OUT. So you see there's zero difference in the output the LLM will generate.

[-]

ohhi23021@reddit

Mtp uses more vram, meaning lower quant if it dont fit…

[-]

ea_man@reddit

Again: that is not MTP, that is you tuning down the model because your original model was close to the limit of your VRAM and to use MTP you need extra room for heads and compute.

What if I use MTP on a 4B model that leaves extra 4GB of VRAM? Does MTP forces you to lower quants?

What if you place the draft model and its cache on an other GPU?

So show me any benchmark where the model gets less quality because you turn draft-mtp instead that none.

[-]

guigouz@reddit

I gave up on q4 because of the quality, q6 is a bit slower (35tps vs 50tps in my 4060ti), but responses are much better and is way less likely to get into loops

[-]

sn2006gy@reddit

Nice. I get 135 tps at 70 watts with asus dgx spark using int4 auto round so you should be able to squeeze more

[-]

Fluxx1001@reddit

What about prompt prefill with your setup? How long does it usually take to first token?

[-]

Ok-Measurement-1575@reddit

Are the outputs ok, though?

[-]

havnar-@reddit

Probably not, but people in here love to post about their speeds at shitty quants

[-]

Available_Hornet3538@reddit

I don't think will be good coder

[-]

FullstackSensei@reddit

I don't know, I'd rather get a single V100 32GB for under $k if you're running llama.cpp.

More than double the memory bandwidth. Idle power might be lower on two 4060Ti, but V100 will have significantly lower power consumption under load.

[-]

Chuyito@reddit (OP)

I used to roll with vllm for years for dual GPU since llamacpp had layer and row split.

Recently tensor split got MUCH more mature on llamacpp which brought it to par with vllm for multi-gpu.

[-]

miversen33@reddit

I'm really curious about tensor split but when I am able to use it, the perf is just no where near as good as basic layer split. I'm using AMD which I suspect is part of the issue but I'd love to hear a bit about your configuration to see if I can get tensor split working well

[-]

ai_without_borders@reddit

tensor split in llama.cpp fixing the layer split overhead changed this whole calculus was skeptical of dual mid-range for a while since you used to lose so much bandwidth efficiency to layer routing overhead. but now you actually get close to linear scaling on inference. for a startup running internal tooling this is a much easier argument than waiting months for your single high-end card to die with no spare in stock

[-]

Chuyito@reddit (OP)

100% this on the startup running local tooling. I think the big thing for me was that Q2 2026 models became useful enough as a daily-driver for certain work tasks, and the inference tools got sped up to make homelab infra actually feasible.

It feels like it's been one compounding imprivement/fix after another:

- Tensor support llamacpp

- llama server built in api to toggle models quickly: \~15s to change between 27b dense and 35b3a whereas months ago that would have been minutes

- MTP and whatever the latest version of speculative computing they did without losing accuracy

- Whatever the podman/nvidia guys did to make container gpu stable

Open source has been cooking.

[-]

gtrak@reddit

35b is too dumb for the coding I do. 70 tok/s on 27b q6_k with 2x5060ti.

[-]

Client_Hello@reddit

Could you share your llama build and full launch command?

How did you get parralel tensor and quantized kv, I thought that was not yet supported?

I can only fit 96k context with 27b q6_k with MTP and sm tensor.

[-]

gtrak@reddit

Sure, you need to build from this PR to have quantized kv-cache: https://github.com/ggml-org/llama.cpp/pull/23792

./llama-server \
      --port 1234 \
      --host 0.0.0.0 \
      --model "models/Qwopus3.6-27B-v2-MTP-Q6_K.gguf" \
      --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 1.5 \
      -fa on -t 12 \
      --fit-target 64 \
      -ctk q8_0 -ctv q8_0 \
      --split-mode tensor \
      --spec-type draft-mtp --spec-draft-n-max 3 \
      -fit off \
      --ctx-size 180000 \
      -b 1024 -ub 512 \
      -lv 4 -ngl 999 \
      -kvu \
      --no-mmap \
      --parallel 1 \
      --cache-ram 24000 \
      --chat-template-file "models/chat_template.jinja" \
      --chat-template-kwargs '{"preserve_thinking": true}' \
      --seed 3407 \
      --jinja

[-]

Client_Hello@reddit

Thanks, will try that. Any build flags beyond the usual cuda stuff?

[-]

gtrak@reddit

nothing special:

cmake -B build \
  -DGGML_CUDA_FA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_NATIVE=OFF -DCMAKE_CUDA_ARCHITECTURES=120a \
  -DCMAKE_BUILD_TYPE=Release

[-]

Dandz@reddit

How? Sm tensor? I get about 30 tps

[-]

gtrak@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1tryp2q/comment/ooslmzg/

[-]

Fair_Ad_1344@reddit

I did some actual A/B comparison between 27B and 35B-A3B in SQL generation last night, in a pipeline that gives the LLM a lot of instruction and the ability to retrieve full table schemas and semantic hints. Running both at Q4 and MTP enabled, given the exact same query, 27B did significantly worse, continuing to hallucinate column names despite schema access and prior examples. 35B-A3B had zero hallucinations and produced no technical SQL issues, such as GROUP BY errors. It was reproducible.

Also, llama.cpp has come a long way in supporting dual GPUs and handles MXFP4 along with NVFP4 quants on Blackwell cards just fine as long as you compile with support for Blackwell specifically. I have far too many hours logged trying to get vLLM works on dual 5060Tis, and llama.cpp is producing far more than acceptable performance with dual GPUs.

[-]

gtrak@reddit

I mostly do rust or clojure. I don't have hallucinations like that. 27b can one shot small to medium tasks as a subagent with another model orchestrating like opus or kimi. If I'm just exploring, I'll have it act as orchestrator, too. 35b devolves into paren counting faster and can't recover, or is just worse at reasoning over nontrivial codebases so just wastes time.

[-]

Dandz@reddit

What's it mean to compile llama.cpp for blackwell specifically? Is there a different flag or something?

[-]

overand@reddit

Oh damn, I assumed they were on 27b and was trying to figure out how I could get closer to numbers like that on my dual 3090 setup. But yeah, 35B, that makes sense.

I do remember it looking decent on coding benchmarks, but with 27b as an option, yeah...

[-]

PigSlam@reddit

You seem to be getting similar performance to the Radeon Pro AI R9700 32GB I just got. You're using two PCIe slots to do it, but it costs less than my ~$1500 GPU.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

rdkilla@reddit

have you tried different split modes?

[-]

jaybsuave@reddit

wow this is impressive ngl

[-]

Cultural-BookReadeR@reddit

Almost same setup, but can't even start llama.cpp with your params for dense model at mtp with tensor split. Can you share full config, please?

[-]

Chuyito@reddit (OP)

Sure,

podman run -d \
  --name llama-qwen36-router \
  --device nvidia.com/gpu=all \
  -v /data/models:/root/.cache/huggingface:ro \
  -v /data/vllm_cache:/cache:Z \
  -v /data/llama_presets:/presets:ro \
  -p 8001:8080 \
  --env NVIDIA_VISIBLE_DEVICES=all \
  --env LD_LIBRARY_PATH=/app:/usr/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 \
  --ipc=host \
  --restart=unless-stopped \
  ghcr.io/ggml-org/llama.cpp:server-cuda13 \
  --models-preset /presets/qwen36-models.ini \
  --models-max 1 \
  --host 0.0.0.0 \
  --port 8080

And qwen36-models.ini:

version = 1

[*]
n-gpu-layers = all
host = 0.0.0.0
port = 8080

ctx-checkpoints = -1
mmap = false
flash-attn = on

; threads = 16
; threads-batch = 20
; n-cpu-moe = 80
cache-ram = 2048
parallel = 1
batch-size = 2048
ubatch-size = 1024

jinja = true
reasoning = on
reasoning-budget = 1000
metrics = true
load-on-startup = false


[qwen36-27b-mtp-tensor]
hf-repo = unsloth/Qwen3.6-27B-MTP-GGUF
hf-file = Qwen3.6-27B-UD-Q4_K_XL.gguf

split-mode = tensor
tensor-split = 0.95,0.95
ctx-size = 100000 
spec-type = draft-mtp
spec-draft-n-max = 2


[qwen36-35b-a3b-mtp-q4xl-mtpOn-Tensor]
hf-repo = unsloth/Qwen3.6-35B-A3B-MTP-GGUF
hf-file = Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

split-mode = tensor
tensor-split = 0.97,0.97
ctx-size = 125000 
spec-type = draft-mtp
spec-draft-n-max = 2

If you are still running llamacpp per model instead of with the server, it would be

podman run -d \
  --name llama-qwen36-35b-a3b-mtp-gguf \
  --device nvidia.com/gpu=all \
  -v /data/models:/root/.cache/huggingface:ro \
  -p 8001:8080 \
  --env NVIDIA_VISIBLE_DEVICES=all \
  --env LD_LIBRARY_PATH=/app:/usr/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 \
  --ipc=host \
  --restart=unless-stopped \
  ghcr.io/ggml-org/llama.cpp:server-cuda13 \
  --hf-repo unsloth/Qwen3.6-35B-A3B-MTP-GGUF \
  --hf-file Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers all \
  --ctx-size 125000 \
  --ctx-checkpoints -1 \
  --batch-size 2048 \
  --mmap \
  --ubatch-size 512 \
  --flash-attn on \
  --split-mode tensor \
  --tensor-split 0.97,0.97 \
  --threads 16 \
  --threads-batch 20 \
  --cache-ram 2048 \
  --parallel 1 \
  --jinja \
  --reasoning on \
  --reasoning-budget 1000 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  --metrics

[-]