Qwopus3.5 V3 is awsome for a local llm
Posted by chocofoxy@reddit | LocalLLaMA | View on Reddit | 13 comments
I tried qwopus3.5 by Jackrong and it’s very powerful it ‘s more stable and smarter than base qwen3.5 i tried the gguf 9b version it surprised me cause i never got to use qwen3.5 9b by linking it to qwen code or continue it always hang and the client disconnects after 2 messages but this model is just a beast it’s enhanced by opus 4.6 did anyone else tried it ?
SimilarManagement414@reddit
I have gpu RTX 4090 with 24G
I tried to run this model using vllm but always got error liek these :
# (APIServer pid=1799219) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
# TokenizersBackend not found
Can you help ?
chocofoxy@reddit (OP)
are you using vllm ? because i tried it on vLLM and LM studio
Worried_Drama151@reddit
It’s already outdated with Gemma 4
AppealSame4367@reddit
Gemma 4 is a great conversationalist, but not a super great reasoner over big context.
Tricky-Contact-5028@reddit
I tested Gemma4 27b4A - it struggled calling tools sometime coding also I didn't got much out of it
grumd@reddit
It's clear you haven't actually used them and compared their performance. Qwen is still much better for coding for example.
Finanzamt_Endgegner@reddit
gemma4 and qwen3.5 are similar performance, in some cases gemma might be better in others qwen3.5, i wouldnt say that that makes qwen3,5 models outdated lol
apollo_mg@reddit
This is a really nice model for local agentic orchestration. Still in the first couple days of testing using open-mult-agent, but so far I really like it. Competent coding skills too. Using v2 k2_k, 16GB VRAM, turboquant3, 65k context. Having to use some stability hacks atm on my 9070XT, but getting like 25 tps ish.
apollo_mg@reddit
export HSA_OVERRIDE_GFX_VERSION=12.0.1
export HSA_ENABLE_SDMA=0
export AMDGPU_CWSR_ENABLE=0
export HSA_XNACK=0
# --- Launch Server ---
# Utilizing the TurboQuant Asymmetric KV Caching (-ctk q8_0 -ctv turbo3)
$SERVER -m "$MODEL" \
-c 65536 \
-b 512 \
-ctk q8_0 \
-ctv turbo3 \
-cb \
-fa on \
-np 1 \
-ngl 99 \
--cache-ram 0 \
--port 8082 \
--host 0.0.0.0 \
--jinja \
--chat-template-kwargs '{"enable_thinking":true}'
chocofoxy@reddit (OP)
for some reasong vllm kept breaking on me because of qwen 3.5 architecture either that or coudn't use kv cache offloading cause i have 16gb gpu and 32gb dram
eki78@reddit
I'm getting 25tok/sec with a 10 years old GPU ! thank you Qwopus !
Tricky-Contact-5028@reddit
Trying today lets see
chocofoxy@reddit (OP)
try it and tell me what you think