Any good MOE ~60B models? I have 64GB vram

Posted by opoot_@reddit | LocalLLaMA | View on Reddit | 32 comments

I have a build with 2 x MI50 32GBs and 64 gigs of DDR4 (bought before rampocolypse for \~630 USD total, I’m not rich) and I’m not gonna upgrade it for a long while.

Are there any good MOE models that are around 60B in parameters so I can make use of all the VRAM? I feel like I’m stuck in a weird spot where using small models fees like a waste but I can’t really use larger models.

Any suggestions? Thanks

[-]

ambient_temp_xeno@reddit

I gave up on trying to get creative writing out of LLMs, but minimax m2.5 is worth trying. I didn't hate the stories it gave me, but whether they're better than gemma 4 31b I don't know.

[-]

Kahvana@reddit

Personally I'm getting quite decent results compared to previous local models like magistral or cydonia, and my system prompt for sillytavern isn't even optimal.

I reckon you can get much better results immediately by using koboldcpp with gemma4-31b, text completion (I'm using chat completion), enable dry + xtc and don't use presence/repeat/frequency penalty. top-nsigma might get you better results than top-p and top-k.

My not quite optimal system prompt below, might give some inspiration:

<|think|>

You are {{char}}, the game master of this simulation.
User's avatar in this simulation is {{user}}.

<SIMULATION>
The simulation is turn-based:
1. User replies.
2. You advance the simulation by the smallest possible increment.

XML comments contain instructions the game master must follow.
</SIMULATION>

<NPC>
An NPC is a character in the simulation that isn't User/{{user}}.
An NPC is limited to their five senses in precognition.

An NPC has the following traits:
- An unique name/race/gender/age/appearance.
- A personality based on MBTI type and Chinese/European zodiac.
- 1 to 10 positives.
- 1 to 10 negatives.
- 1 to 10 skills.
- 1 to 10 flaws.
- 1 dream.
- 1 fear.
- A backstory.

To name an NPC, do the following:
If the name is already known, skip generation and use the known name. Else:
1. Generate the first list of twenty unique names.
2. Generate the second list of twenty unique names that has no overlap with the first list.
3. Select a name at random from the second list.

When instructed to generate the NPC block, output in the following format:
<!--
- Name: string
- Age: number
- Race: string
- Gender: string
- Appearance: string a short paragraph
- Personality: string (a short paragraph)
- Positives: string[1 to 10]
- Negatives: string[1 to 10]
- Skills: string[1 to 10]
- Flaws: string[1 to 10]
- Dream: string (a short paragraph)
- Fear: string (a short paragraph)
- Backstory: string (a short paragraph)
-->
</NPC>

<WRITING>
For writing, follow these rules:
- Show, don't tell.
- Avoid metaphors.
- Avoid empty platitudes.
- Avoid epanorthosis.
- Avoid omniscient narration.
- Address User's avatar as "you"/"your".
- Don't speak or act for User/{{user}} nor narrate User/{{user}}.
- Only speak or act for {{char}}.
- Use plaintext and simple English.
- Write three to seven paragraphs with varying sentence structure.
</WRITING>

Obviously the writing is based on preference. "Avoid epanorthosis" prevents the "It's not X, It's Y" pattern, "Avoid empty platitudes" removes the cliche "Time heals all wounds" type of dialogues.

I might remove "avoid metaphors" as it might limit Gemma 4's natural speech too much, and forbidding empty platitudes at least get rid of the annoying ones.

[-]

ambient_temp_xeno@reddit

The best (lol) stuff I got out of LLMs was when I had early models cranked into almost/actually surreal states and then completely re-writing it or stealing the ideas it accidentally gave me.

[-]

mycall@reddit

I do miss GPT-3 word salad mmmm.

[-]

devildip@reddit

Gemma 4 31b is a dense model. The 26b variant is the MOE. I get 17 t/s on my 4060 laptop with 64gb of ram on the 8bit and 11 t/s on the 16 float. With your hardware, you should get much faster speeds.

Truthfully, if were looking at benchmarks, the only superior and realistically operable open source model is glm 5.

[-]

mateszhun@reddit

"Sadly" Qwen 3.6 27B and Gemma 4 31B dominates the open non-large models. You can find better for vision, but not for general or coding tasks. If you want speed check out Qwen 3.6 35B, that may satisfy you.

[-]

Last_Mastod0n@reddit

Just curious, what large model is considered better at vision?

[-]

mateszhun@reddit

Nemotron 3 Super, Ovis 2.6 80B

[-]

Last_Mastod0n@reddit

Wow nemotron didnt expect that. The nemotron models have continued to disappoint me compared to qwen and Gemma. I may have to see if I can generate a few responses just to test it (albeit slowly) with my 64gb ram

[-]

HypnoDaddy4You@reddit

I think a lot of the early nemotron builds have problems with support in the tool chain, based on things I've seen like repeating tokens and vowels with diacritics

[-]

craftogrammer@reddit

anything for 96GB RAM but 16GB VRAM? 5080?

[-]

FatheredPuma81@reddit

You could try Nemotron or Qwen3.5 122B. Pretty sure both are worse than Qwen3.6 35B and they'll be a lot slower though.

[-]

mateszhun@reddit

Same recommendation.

[-]

AsliReddington@reddit

Qwen, Nemotron3 & Gemma, your task should be picking between them

[-]

pmttyji@reddit

Q4/Q5/Q6/Q8 of Qwen3.6-35B-A3B, Qwen3.6-27B, Gemma-4-26B-A4B

MXFP4 of GPT-OSS-120B

Q4/Q5/Q6 of Qwen3-Coder-Next

Q4 of Qwen3.5-122B-A10B, Nemotron-3-Super-120B-A12B, Mistral-Small-4-119B-2603, GLM-4.5-Air

[-]

munkiemagik@reddit

With qwen3.6-27b performing as well as it does for its size, what kind of tasks do you find yourself preferring to use GPT-OSS-120B? Im interested in your (and u/kevin_1994) as I used to run OSS-120b a lot last year and havent fired it up in a while but for some reason I cant bring myself to wipe it off the system.

[-]

kevin_1994@reddit

Its really good at math. Even today, even compared to frontier. And, the alliterated versions combined with lack of sycophancy make it decent for bouncing ideas off of

[-]

trucekill@reddit

I just picked up a couple MI50s in hopes of running Qwen3.6 27B at Q4. Can you get max 256K context out of that setup?

[-]

havnar-@reddit

Qwen 3.6 barely fits my 64gb Mac Pro, so don’t kid yourself

[-]

NotARedditUser3@reddit

Gemma4 31b for creative writing feels a little silly. LFM2-24b-a2b will scream with speed for creative writing, and is still decently intelligent.

[-]

Karyo_Ten@reddit

Why silly, or is it a pun with SillyTavern?

[-]

FatheredPuma81@reddit

There is straight up nothing unless you want to run Qwen3.5 122B at 3 bit or Expert Offloaded to CPU. On Mi50's your best bet is Qwen3.6 35B at Q8. Qwen 27B and Gemma 4 31b at Q8 will be glacial even with MTP/Draft/Ngram(which ever combo gives the best results).

[-]

JaredsBored@reddit

Shockingly Mi50 with Q8 27B is pretty usable token generation. Q8 and Q4_1 are pretty much speed equivalent too because the Mi50 needs to up-cast q4 weights to process. With MTP I think 27B will be over 30tps generation. I'd check but I literally just replaced mine...

[-]

metmelo@reddit

I'm getting 24 t/s on mines (vs 18 before MTP).

[-]

JaredsBored@reddit

Are you quantizing KV and using flash attention? The Mi50 does not like Q8 or Q4 KV, really hurts performance.

[-]

FatheredPuma81@reddit

Well if the OP reads this there you go. Go for Q8 27B then with F16 KV Cache. Idk about MTP though good luck setting it up because I apparently can't.

[-]

kevin_1994@reddit

I'm in a similar boat: 128gb RAM, 48 GB VRAM (4090 and a 3090)

Right now nothing has been able to beat qwen3.6 27b q8. So my RAM is just chilling.

Other options are:

Qwen 3.6 35B A3B @bf16 -> I find at bf16 the model is almost as smart as the dense model at q8, but much worse at agentic coding, and falls into loops much more easily
Qwen 3.5 122B A10B @ q6Xl -> I wouldn't recommend running it any lower than q6. Its a bit more knowledgeable than 3.6 27b and probably better for general use. But it is a little dumber for agentic, vision is worse, and it ends up being slower since pp is much worse when can't fit the whole thing in VRAM. It also tends to fall in loops and tool calling can degrade more often than the dense model.
Nemotron Super 120B (I don't remember active params, similar to qwen 122b I believe) -> its like a slightly dumber, slightly faster, slightly less overthink version of qwen 122b.
GPT OSS 120B A5B -> surprisingly well rounded model and still excellent for anything non-agentic. Its very fast since you can run at MXFP4. Also I just like the personality of this model. Its not sycophantic.

[-]

DinoAmino@reddit

Honest question: most all MoEs are reasoning models that are trained for solving complex problems - are they actually good at creative writing? Are they better than non-reasoning models?

[-]

Gwolf4@reddit

Isn't creative writing reasoning in it's core ?

[-]

PromptInjection_@reddit

Try

https://huggingface.co/jdopensource/JoyAI-LLM-Flash

49B.

[-]

VoiceApprehensive893@reddit

q8 gemma 4/qwen 3.6

[-]

I-will-allow-it@reddit

Get qwen3.6 both models in MTP with Q8. The MTP is a really good speed boost. GPT OSS 120B is still pretty good for some things, but not great at tools. Qwen3.5 122B is pretty good as well, but it’s hard to beat the 3.6 MTP models right now. Gemma 4 models are also good, and for some things are they beat the 3.6. Gemma is a lot better at making podcasts using open-notebook.