Qwen3.6-35B-A3B-Uncensored-Genesis-APEX-MTP

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 67 comments

Here model: https://huggingface.co/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-V2-APEX-MTP-GGUF

Safetensors: https://huggingface.co/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-V2-FP8-Safetensors

Testing results in Open Code on hardware (Beelink gtr9 pro + Strix Halo) done by my friend on Q8_K_P - MTP quant:

5 sessions with 200k context, not a single glitch, no loops, no repeated tool calls.
After 120k tokens I suddenly gave another task that doesn't intersect with what it was doing at all, and it calmly picked up and solved it correctly.
Uncensored with MTP support with APEX quantization.

Recommended quant: APEX, APEX-MTP

Recommended settings for LM Studio:

System Prompt

Chat Template

Or use this minimal string as the first line:

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

Then add anything you want after. Model may underperform without this first line.

Settings:

Parameter	Value
Temperature	0.7
Top K Sampling	20
Presence Penalty	1.5
Repeat Penalty	1.0
Top P Sampling	0.8
Min P Sampling	0
Seed	42

Enjoy 😄

[-]

kukalikuk@reddit

How did you use MTP on LM Studio? My MTP model keep failing to load by LM studio.

[-]

Timboman2000@reddit

It was only added within the past couple days, so you need to update both LM Studio and the Llama.cpp backend you're using to the latest version.

[-]

EvilEnginer@reddit (OP)

I simply updated LM Studio, and llama.cpp in it to latest version. After that I enabled MTP support in advanced model loading settings. That's it.

[-]

Flkhuo@reddit

Can you run this on rtx 4090 24gb vram 200k context and above 120tok/second?

[-]

No-Implement9967@reddit

LocalLLaMA users casually running 35B models with 200k context on mini PCs while big tech still says “requires 8 H100s” 💀

[-]

Tough_Frame4022@reddit

Try Qwen 80b with one million context and streaming weights on a 24 GB 3090 and 64 GB of ram. That's why I'm doing

[-]

Tough_Frame4022@reddit

Passes MRCR V2 8 needle and all NIAH tests at 1 million tokens while streaming weights. Plus multi doc QA and long doc retrieval. It's called Canal. It's being hardened now for release.

[-]

Iup i just set up this 35b on a 4gb old amd card and 16gb of ram. Turboquant, mtp and caching make it very usable at 5tps on 110k context, i can usa hermes or similar agentic workflows, slow but usable on a very old pc.

[-]

Rare_Potential_1323@reddit

And don't forget about REAP models

[-]

Icy-Degree6161@reddit

Aren't they a lot dumber?

[-]

Rare_Potential_1323@reddit

This video might be of interest to you https://www.youtube.com/watch?v=QIZz4AF0U24

[-]

CalligrapherFar7833@reddit

They are running 35b 3b moe not 35b dense

[-]

Thebandroid@reddit

he never said how long those 5 sessions took

[-]

redditpad@reddit

Very cool

[-]

Squidgical@reddit

I guess they're considering "what hardware is needed to run this model as a service"? Otherwise wtf honestly.

[-]

EvilEnginer@reddit (OP)

Yep 😂

[-]

kyr0x0@reddit

Does it exist as 27B as well?

[-]

kyr0x0@reddit

u/EvilEnginer if you DM me, I can create it. I have massive GPU resources rn and they idle.. just have no time to come up with the scripts myself. I can also up the final results for you so that they live under your name, giving you the credits.

[-]

EvilEnginer@reddit (OP)

Thanks :). 27B can be processed on Google Collab Free Tier. I don't need GPU resources for it.

[-]

kyr0x0@reddit

Oh great! Thanks 🙏 will you do it? Also - would you be so kind to share the scripts on GitHub so the community could come up with those models as well whenever 3.7 or so pops up? :)

[-]

EvilEnginer@reddit (OP)

The scripts are my core IP - they're the result of months of reverse-engineering tensor geometry. I don't plan to open-source them. What I do plan is to keep releasing Genesis models for the community when new versions drop.

About 27B. I tried to fix it. I didn't like the results, it's still looping too much. So I will stick with 35B-A3B instead, because it's fast and efficient.

[-]

kyr0x0@reddit

Okay, no worries :) Thanks for the weights!

[-]

FarRub2855@reddit

Holding 200k context without looping is a huge deal for parsing massive call transcripts. Definately grabbing the APEX quant to see how it handles my messy notes later.

[-]

RefrigeratorMuch5856@reddit

Pfff what are this system interactions? Are there any papers that measure if this style actually works?

[-]

EvilEnginer@reddit (OP)

I tested. It works nicely with Qwen.

[-]

RefrigeratorMuch5856@reddit

Yes but why can’t these be written in a normal way? Models are not deterministic when using temp > 0 so how do you know?

[-]

EvilEnginer@reddit (OP)

It can be written in normal way 😄. Feel free to experiment.

[-]

bhagathgoud99@reddit

Can I offload it like MoE? I'm using 35B on 8GB Vram using MoE. Will MTP run with same speed like MoE?

[-]

EvilEnginer@reddit (OP)

Yes you can. Just pick APEX Compact quant.

[-]

mycall@reddit

How do you get MTP and vision to play nicely together?

[-]

EvilEnginer@reddit (OP)

I extracted and transferred MTP tensors from Unsloth quants. I am not using MTP by myself. It's really slow on my RTX 3060.

[-]

Miserable-Dare5090@reddit

How convinced are you that any “special” system prompt is necessary? My experience is, no model has a need for a special prompt as much they have a need for YOUR special prompt. engineer your own prompt for the task and you’ll have better results.

[-]

EvilEnginer@reddit (OP)

You can use your own System Prompt if you want.

[-]

Medical-Newspaper519@reddit

What's the speed u get for the Q8 on your 3060?

[-]

EvilEnginer@reddit (OP)

I can't run Q8 on RTX 3060 12GB. My friend on his AI mini PC has 50 tokens per second.

I am using APEX Compact. Have 18 tokens per second on CUDA 12 llama.cpp (v2.16.0) in LM Studio.

[-]

Medical-Newspaper519@reddit

So you're using Q4 gguf?

[-]

EvilEnginer@reddit (OP)

Yes. I am using Q4_K_M (APEX Compact)

[-]

Medical-Newspaper519@reddit

I feel like you should get more than 18 tps on your 12gb vram for that model

[-]

EvilEnginer@reddit (OP)

Yes, I can get more tps via pure llama-server. But I just like LM Studio, it's simply amazing, and doesn't overload my GPU.

[-]

Medical-Newspaper519@reddit

Oke, understandable ;)

[-]

alchninja@reddit

With 8GB VRAM you're unlikely to see very big gains from MTP, if any. MTP models are larger than their typical NTP counterparts due to the addition of the internal token prediction layers. For MTP MoE with CPU offload this ultimately means fitting fewer layers on your GPU, resulting in you becoming more bound by CPU compute and memory bandwidth. It's still worth a try, but I suspect any token generation speed improvement you get from MTP will be significantly undercut by the less efficient GPU usage. Your prompt processing speed will also take an unavoidable hit because of how MTP works.

I'm on 16GB VRAM and found the MTP tg/s gain to only be around 5-10% at best, degrading faster then NTP as context fills up until it eventually becomes noticeably slower than the NTP model.

[-]

Bastron@reddit

Im getting a good speedup in generation when offloading from my 16GB card, though prompt processing slows down. But its still much faster overall in agentic tasks for me, try to do a side by side comparison depending on your use case, i bet you will be suprised!

[-]

kyr0x0@reddit

Seed 42 was genius ;)

[-]

EvilEnginer@reddit (OP)

Thanks 😉. Yep I like this number.

[-]

rohitmdksub@reddit

I have 3060ti rtx 8gb and 12gb of ram. I have been using deepseek v4 . Do u think i can use qwen 3.6 35B

[-]

EvilEnginer@reddit (OP)

I think not. You need at least 32 GB RAM to run this model.

[-]

TheCTRL@reddit

Preserve thinking works too ?

[-]

EvilEnginer@reddit (OP)

Yes. It works. Use Chat Template Thinking.

[-]

TheCTRL@reddit

Sorry man but I have problems with tool calling

[-]

TheCTRL@reddit

Tnx I’ll give it a try!

[-]

Napster3301@reddit

this is a real issue with the abliterated/uncensored gguf quants generally, not specific to apex. they were converted from upstream qwen3-coder before the 2025-08-05 chat template fix got merged, so the embedded jinja still emits the broken bracket variants ([function=X], function=NAME, mixed) instead of clean openai tool_calls arrays.

fix is to override the embedded template at launch with the upstream one. for llama-server: --chat-template-file /path/to/template.jinja, source it from https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct/resolve/main/chat_template.jinja. lm studio has it in advanced settings iirc.

with the override the tool calling is clean. json/tool_call issues you see are almost always template-side not weights-side. abliteration doesnt touch tool calling behavior, only the refusal classifier paths.

[-]

Dadda9088@reddit

Is it able to see images with MTP?

[-]

EvilEnginer@reddit (OP)

Yep. With MTP prompt generation for images works too.

[-]

Dadda9088@reddit

Oh good to know and time to switch 😉 Thanks for this, I will try as soon as my agent finish its tasks

[-]

EvilEnginer@reddit (OP)

Nice. Share your impressions later 😄.

[-]

Creative_Bottle_3225@reddit

why do you use a fixed seed?

[-]

EvilEnginer@reddit (OP)

Personal preference from image generation.

[-]

Top_Speaker_7785@reddit

anyone tested this for tool calling/structured output? the uncensored models sometimes break json formatting in my experience

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Wide_Amount5369@reddit

Absolutely crazy 😍

[-]

EvilEnginer@reddit (OP)

Thanks 😄

[-]

ps5cfw@reddit

I've really never managed to get anything good out of APEX Quants when using them with all coding agents. They just go off on the wrong tangent and / or make wrong tool calls, or start looping heavily. And I've always gone for the QUALITY presets, which should be the one with the best results.

[-]

EvilEnginer@reddit (OP)

Yep. So far this is main reason why my friend is using Q8_K_P. Can't do quality APEX. I am on RTX 3060.