Got hipfire running in Docker on my RX 7900 XTX alongside llamacpp

Posted by AgentErgoloid@reddit | LocalLLaMA | View on Reddit | 13 comments

Been dealing with long context failures on Qwen3.6 27B and stumbled onto hipfire. Spent an evening dockerizing it so it runs alongside an existing llamacpp stack without touching anything.

Running Qwen3.6 27B MQ4 on a 7900 XTX. The TriAttention sidecar and DFlash draft both load correctly per the logs. ~40 tok/s AR, haven't confirmed DFlash is actually engaging yet. Still early but it responds correctly and the API is clean.

One thing that tripped me up: hipfire isn't a single binary you just run. The CLI is a Bun/TypeScript HTTP server that spawns the engine as a subprocess. Relevant if you're trying to dockerize it.

If there's interest I'll put the Dockerfile and compose setup on GitHub tomorrow. Happy to answer questions in the meantime.

[-]

Optimal_Guava5390@reddit

Does this work on MoE models??

[-]

AgentErgoloid@reddit (OP)

Haven't tried it personally, but it's supported. With caveats. DFlash is disabled by default on A3B and there's a known thinking-loop bug.

[-]

Optimal_Guava5390@reddit

Just tried it, I’ve only a 7900xt don’t have the room for the MoE hipfire anyway and the 27b is pretty equal to my Vulkun setup anyway. Gutted that’s a very cool setup they’ve got but the VRAM need push’s me more and more to the 7900XTX regret I’ve had lately 🤦🏼‍♂️

[-]

ROS_SDN@reddit

Im a bit confused can you dumb this down for me?

I have two 7900xtx, is this like an inference added back end I bolt into llama CPP to increase decode? Does it help prefill?

How should I be implementing this, they compare it you ollama not llama CPP?

I see its in alpha so I'm skeptical and I see its rocm 6.+ which makes sense, but does it help making rocm 7+ viable on a 7900xtx?

I'd love better decode, and better prefill, but I'm not sure how it bolts in.

Im assuming it's stand alone and I'm still beholden to rocm 6.4 ish on the 7900xtx, because rocm is the issue not llama CPP for gfx1000 support at newer builds.

[-]

AgentErgoloid@reddit (OP)

Not a llamacpp addon, it's a completely separate inference engine. You swap llamacpp out entirely and point your client (opencode, continue, whatever) at hipfire's OpenAI-compatible API instead.

Decode: yes, big gains. ~40 tok/s AR on 27B, up to 185 tok/s on code with DFlash speculative decoding. Prefill is actually a weak spot vs llamacpp right now, there's an open issue tracking the regression.

Two 7900 XTX: not yet. Multi-GPU isn't implemented. Single card only for now.

ROCm: it ships its own HIP runtime via dlopen so you don't need the full ROCm stack installed, just libamdhip64.so. Doesn't help or hurt ROCm 7 viability, it sidesteps the question entirely.

Alpha caveat is real. Tool calling is broken, long outputs loop, thinking mode is unstable. Short code generation is where it shines right now. Worth running alongside llamacpp rather than replacing it until it matures.

[-]

mbrodie@reddit

This lines up with my hip fire testing….

I have 2x 7900xtx in my server hipfire only runs single cards you can’t offload between 2 so the whole model and cache needs to fit in the card, it’s probably the highest performing engine for RDNA 3 right now but there is trade offs.

You’re using an mq4 or mq6 at most and I’m
Not sure how much control there is next to other platforms… like in my benchmarks they wasted all tokens on thinking and did 0 output compared to the ggufs I was using for 8 bit quants etc…

Definitely promising and worth keeping an eye on though!

[-]

AgentErgoloid@reddit (OP)

Single card here too, so same constraint. The thinking token issue is real. I hit exactly that, 512 tokens consumed with empty output. Partially addressed by bumping repeat_penalty to 1.1 and dropping temperature to 0.6, but long-form structured output still loops. Feels like an engine-level issue rather than config. Tool calling is also unreliable. Running it through pi as an agentic harness, the model occasionally produces malformed JSON in tool calls (dropped quotes etc.) which hangs the agent loop. Same prompts work fine on the GGUF via llamacpp. Probably MQ4 quantization affecting output stability on structured formats. Short responses and Q&A are solid and it feels fast for interactive use.

[-]

RoomyRoots@reddit

Very tangential to your approach. I am using Incus instead of Docker for my images. I found it much easier to make a base install and generate images from it. Worth checking out if you have issues with docker.

[-]

AgentErgoloid@reddit (OP)

Interesting, hadn't looked at Incus. My main motivation for Docker was keeping hipfire isolated from my existing llamacpp stack and having a reproducible build. Does Incus handle that use case well? Curious if you've run ROCm workloads through it.

[-]

RoomyRoots@reddit

Yeah, it does, I have nothing running local. There is no way I am doing anything AI without at least some sandboxing. I pass my GPU's and NVMe to images when I need as a I can do it dynamically without needed to restart anything.

I tested hipfire the other day but since it doesn't work offloading to GPU I couldn't run 27B so I am back to llamacpp and vLLM.

[-]

Puzzleheaded-Drama-8@reddit

That's quite a small improvement, I get 36tk/s on regular llamacpp on vulkan. How about prompt processing?

[-]

AgentErgoloid@reddit (OP)

The 40 tok/s figure is also calculated, not directly observed. hipfire's architecture is a Bun/TypeScript HTTP server that spawns the inference engine as a subprocess over stdin/stdout IPC. The engine stats (tok/s, spec accept rate) are consumed internally by the Bun layer and never surfaced via the API or logs. So there's no way to read them from outside without patching the CLI. I timed responses and divided by token count.

In practice it feels faster than that number suggests. First token appears almost instantly, then it gets stuck mid-sentence on longer outputs. So the perceived speed is good but reliability on long-form generation is still rough.

[-]

XccesSv2@reddit

I get with hipfire
Prefill    tok/s      mean      min      max    stdev     ms
────────────────────────────────────────────────────────────────
pp128               451.2    449.8    452.5      1.0   283.7
pp512               453.4    452.7    454.2      0.5   1129.2

mean      min      max    stdev
──────────────────────────────────────────────────────────
Prefill tok/s      262.0    261.5    262.4      0.3   (user prompt, 20 tok)
TTFT     ms          76.4     76.2     76.5      0.1
Decode   tok/s       43.0     42.7     43.1      0.2
Wall     tok/s       41.9     41.6     42.0      0.2

And with llama.cpp ROCM with the unsloth Q4_0 Quant
| model                          |       size |     params | backend    | ngl | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| qwen35 27B Q4_0                | 14.70 GiB |    26.90 B | ROCm       | 999 | ROCm0        |          pp2048 |        957.37 ± 1.48 |
| qwen35 27B Q4_0                | 14.70 GiB |    26.90 B | ROCm       | 999 | ROCm0        |           tg128 |         35.06 ± 0.03 |

So Prompt Proccessing is very slow but token generation (with draft model) is higher.