Qwen3.6-35B-A3B-Uncensored-Genesis-APEX-MTP
Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 67 comments
Here model: https://huggingface.co/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-V2-APEX-MTP-GGUF
Safetensors: https://huggingface.co/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-V2-FP8-Safetensors
Testing results in Open Code on hardware (Beelink gtr9 pro + Strix Halo) done by my friend on Q8_K_P - MTP quant:
- 5 sessions with 200k context, not a single glitch, no loops, no repeated tool calls.
- After 120k tokens I suddenly gave another task that doesn't intersect with what it was doing at all, and it calmly picked up and solved it correctly.
- Uncensored with MTP support with APEX quantization.
Recommended quant: APEX, APEX-MTP
Recommended settings for LM Studio:
Or use this minimal string as the first line:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
Then add anything you want after. Model may underperform without this first line.
Settings:
| Parameter | Value |
|---|---|
| Temperature | 0.7 |
| Top K Sampling | 20 |
| Presence Penalty | 1.5 |
| Repeat Penalty | 1.0 |
| Top P Sampling | 0.8 |
| Min P Sampling | 0 |
| Seed | 42 |
Enjoy 😄
DaMan123456@reddit
I honestly love your system prompt
EvilEnginer@reddit (OP)
Me too ❤️
kukalikuk@reddit
How did you use MTP on LM Studio? My MTP model keep failing to load by LM studio.
Timboman2000@reddit
It was only added within the past couple days, so you need to update both LM Studio and the Llama.cpp backend you're using to the latest version.
EvilEnginer@reddit (OP)
I simply updated LM Studio, and llama.cpp in it to latest version. After that I enabled MTP support in advanced model loading settings. That's it.
Flkhuo@reddit
Can you run this on rtx 4090 24gb vram 200k context and above 120tok/second?
No-Implement9967@reddit
LocalLLaMA users casually running 35B models with 200k context on mini PCs while big tech still says “requires 8 H100s” 💀
Tough_Frame4022@reddit
Try Qwen 80b with one million context and streaming weights on a 24 GB 3090 and 64 GB of ram. That's why I'm doing
IrisColt@reddit
Is it doable?
Tough_Frame4022@reddit
Passes MRCR V2 8 needle and all NIAH tests at 1 million tokens while streaming weights. Plus multi doc QA and long doc retrieval. It's called Canal. It's being hardened now for release.
pilibitti@reddit
quant? tps?
Dm-Tech@reddit
Iup i just set up this 35b on a 4gb old amd card and 16gb of ram. Turboquant, mtp and caching make it very usable at 5tps on 110k context, i can usa hermes or similar agentic workflows, slow but usable on a very old pc.
Rare_Potential_1323@reddit
And don't forget about REAP models
Icy-Degree6161@reddit
Aren't they a lot dumber?
Rare_Potential_1323@reddit
This video might be of interest to you https://www.youtube.com/watch?v=QIZz4AF0U24
CalligrapherFar7833@reddit
They are running 35b 3b moe not 35b dense
Thebandroid@reddit
he never said how long those 5 sessions took
redditpad@reddit
Very cool
Squidgical@reddit
I guess they're considering "what hardware is needed to run this model as a service"? Otherwise wtf honestly.
EvilEnginer@reddit (OP)
Yep 😂
kyr0x0@reddit
Does it exist as 27B as well?
kyr0x0@reddit
u/EvilEnginer if you DM me, I can create it. I have massive GPU resources rn and they idle.. just have no time to come up with the scripts myself. I can also up the final results for you so that they live under your name, giving you the credits.
EvilEnginer@reddit (OP)
Thanks :). 27B can be processed on Google Collab Free Tier. I don't need GPU resources for it.
kyr0x0@reddit
Oh great! Thanks 🙏 will you do it? Also - would you be so kind to share the scripts on GitHub so the community could come up with those models as well whenever 3.7 or so pops up? :)
EvilEnginer@reddit (OP)
The scripts are my core IP - they're the result of months of reverse-engineering tensor geometry. I don't plan to open-source them. What I do plan is to keep releasing Genesis models for the community when new versions drop.
About 27B. I tried to fix it. I didn't like the results, it's still looping too much. So I will stick with 35B-A3B instead, because it's fast and efficient.
kyr0x0@reddit
Okay, no worries :) Thanks for the weights!
FarRub2855@reddit
Holding 200k context without looping is a huge deal for parsing massive call transcripts. Definately grabbing the APEX quant to see how it handles my messy notes later.
RefrigeratorMuch5856@reddit
Pfff what are this system interactions? Are there any papers that measure if this style actually works?
EvilEnginer@reddit (OP)
I tested. It works nicely with Qwen.
RefrigeratorMuch5856@reddit
Yes but why can’t these be written in a normal way? Models are not deterministic when using temp > 0 so how do you know?
EvilEnginer@reddit (OP)
It can be written in normal way 😄. Feel free to experiment.
bhagathgoud99@reddit
Can I offload it like MoE? I'm using 35B on 8GB Vram using MoE. Will MTP run with same speed like MoE?
EvilEnginer@reddit (OP)
Yes you can. Just pick APEX Compact quant.
mycall@reddit
How do you get MTP and vision to play nicely together?
EvilEnginer@reddit (OP)
I extracted and transferred MTP tensors from Unsloth quants. I am not using MTP by myself. It's really slow on my RTX 3060.
Miserable-Dare5090@reddit
How convinced are you that any “special” system prompt is necessary? My experience is, no model has a need for a special prompt as much they have a need for YOUR special prompt. engineer your own prompt for the task and you’ll have better results.
EvilEnginer@reddit (OP)
You can use your own System Prompt if you want.
Medical-Newspaper519@reddit
What's the speed u get for the Q8 on your 3060?
EvilEnginer@reddit (OP)
I can't run Q8 on RTX 3060 12GB. My friend on his AI mini PC has 50 tokens per second.
I am using APEX Compact. Have 18 tokens per second on CUDA 12 llama.cpp (v2.16.0) in LM Studio.
Medical-Newspaper519@reddit
So you're using Q4 gguf?
EvilEnginer@reddit (OP)
Yes. I am using Q4_K_M (APEX Compact)
Medical-Newspaper519@reddit
I feel like you should get more than 18 tps on your 12gb vram for that model
EvilEnginer@reddit (OP)
Yes, I can get more tps via pure llama-server. But I just like LM Studio, it's simply amazing, and doesn't overload my GPU.
Medical-Newspaper519@reddit
Oke, understandable ;)
alchninja@reddit
With 8GB VRAM you're unlikely to see very big gains from MTP, if any. MTP models are larger than their typical NTP counterparts due to the addition of the internal token prediction layers. For MTP MoE with CPU offload this ultimately means fitting fewer layers on your GPU, resulting in you becoming more bound by CPU compute and memory bandwidth. It's still worth a try, but I suspect any token generation speed improvement you get from MTP will be significantly undercut by the less efficient GPU usage. Your prompt processing speed will also take an unavoidable hit because of how MTP works.
I'm on 16GB VRAM and found the MTP tg/s gain to only be around 5-10% at best, degrading faster then NTP as context fills up until it eventually becomes noticeably slower than the NTP model.
Bastron@reddit
Im getting a good speedup in generation when offloading from my 16GB card, though prompt processing slows down. But its still much faster overall in agentic tasks for me, try to do a side by side comparison depending on your use case, i bet you will be suprised!
kyr0x0@reddit
Seed 42 was genius ;)
EvilEnginer@reddit (OP)
Thanks 😉. Yep I like this number.
rohitmdksub@reddit
I have 3060ti rtx 8gb and 12gb of ram. I have been using deepseek v4 . Do u think i can use qwen 3.6 35B
EvilEnginer@reddit (OP)
I think not. You need at least 32 GB RAM to run this model.
TheCTRL@reddit
Preserve thinking works too ?
EvilEnginer@reddit (OP)
Yes. It works. Use Chat Template Thinking.
TheCTRL@reddit
Sorry man but I have problems with tool calling
TheCTRL@reddit
Tnx I’ll give it a try!
Napster3301@reddit
this is a real issue with the abliterated/uncensored gguf quants generally, not specific to apex. they were converted from upstream qwen3-coder before the 2025-08-05 chat template fix got merged, so the embedded jinja still emits the broken bracket variants ([function=X], function=NAME, mixed) instead of clean openai tool_calls arrays.
fix is to override the embedded template at launch with the upstream one. for llama-server: --chat-template-file /path/to/template.jinja, source it from https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct/resolve/main/chat_template.jinja. lm studio has it in advanced settings iirc.
with the override the tool calling is clean. json/tool_call issues you see are almost always template-side not weights-side. abliteration doesnt touch tool calling behavior, only the refusal classifier paths.
Dadda9088@reddit
Is it able to see images with MTP?
EvilEnginer@reddit (OP)
Yep. With MTP prompt generation for images works too.
Dadda9088@reddit
Oh good to know and time to switch 😉 Thanks for this, I will try as soon as my agent finish its tasks
EvilEnginer@reddit (OP)
Nice. Share your impressions later 😄.
Creative_Bottle_3225@reddit
why do you use a fixed seed?
EvilEnginer@reddit (OP)
Personal preference from image generation.
Top_Speaker_7785@reddit
anyone tested this for tool calling/structured output? the uncensored models sometimes break json formatting in my experience
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Wide_Amount5369@reddit
Absolutely crazy 😍
EvilEnginer@reddit (OP)
Thanks 😄
ps5cfw@reddit
I've really never managed to get anything good out of APEX Quants when using them with all coding agents. They just go off on the wrong tangent and / or make wrong tool calls, or start looping heavily. And I've always gone for the QUALITY presets, which should be the one with the best results.
EvilEnginer@reddit (OP)
Yep. So far this is main reason why my friend is using Q8_K_P. Can't do quality APEX. I am on RTX 3060.