Qwopus3.5 V3 is awsome for a local llm

Posted by chocofoxy@reddit | LocalLLaMA | View on Reddit | 13 comments

I tried qwopus3.5 by Jackrong and it’s very powerful it ‘s more stable and smarter than base qwen3.5 i tried the gguf 9b version it surprised me cause i never got to use qwen3.5 9b by linking it to qwen code or continue it always hang and the client disconnects after 2 messages but this model is just a beast it’s enhanced by opus 4.6 did anyone else tried it ?

[-]

SimilarManagement414@reddit

I have gpu RTX 4090 with 24G
I tried to run this model using vllm but always got error liek these :
# (APIServer pid=1799219) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

# TokenizersBackend not found

Can you help ?

[-]

chocofoxy@reddit (OP)

are you using vllm ? because i tried it on vLLM and LM studio

[-]

Worried_Drama151@reddit

It’s already outdated with Gemma 4

[-]

AppealSame4367@reddit

Gemma 4 is a great conversationalist, but not a super great reasoner over big context.

[-]

Tricky-Contact-5028@reddit

I tested Gemma4 27b4A - it struggled calling tools sometime coding also I didn't got much out of it

[-]

grumd@reddit

It's clear you haven't actually used them and compared their performance. Qwen is still much better for coding for example.

[-]

Finanzamt_Endgegner@reddit

gemma4 and qwen3.5 are similar performance, in some cases gemma might be better in others qwen3.5, i wouldnt say that that makes qwen3,5 models outdated lol

[-]

apollo_mg@reddit

This is a really nice model for local agentic orchestration. Still in the first couple days of testing using open-mult-agent, but so far I really like it. Competent coding skills too. Using v2 k2_k, 16GB VRAM, turboquant3, 65k context. Having to use some stability hacks atm on my 9070XT, but getting like 25 tps ish.

[-]

apollo_mg@reddit

export HSA_OVERRIDE_GFX_VERSION=12.0.1

export HSA_ENABLE_SDMA=0

export AMDGPU_CWSR_ENABLE=0

export HSA_XNACK=0

# --- Launch Server ---

# Utilizing the TurboQuant Asymmetric KV Caching (-ctk q8_0 -ctv turbo3)

$SERVER -m "$MODEL" \

-c 65536 \

-b 512 \

-ctk q8_0 \

-ctv turbo3 \

-cb \

-fa on \

-np 1 \

-ngl 99 \

--cache-ram 0 \

--port 8082 \

--host 0.0.0.0 \

--jinja \

--chat-template-kwargs '{"enable_thinking":true}'

[-]

chocofoxy@reddit (OP)

for some reasong vllm kept breaking on me because of qwen 3.5 architecture either that or coudn't use kv cache offloading cause i have 16gb gpu and 32gb dram

[-]

eki78@reddit

I'm getting 25tok/sec with a 10 years old GPU ! thank you Qwopus !

[-]

Tricky-Contact-5028@reddit

Trying today lets see

[-]

chocofoxy@reddit (OP)

try it and tell me what you think