Model(s) for Creative Writing & Conversational Intuition

[-]

DeepOrangeSky@reddit

Gemma4 31b

Mistral 24b / finetunes

Llama 70b finetunes

Mistral 123b 2407 / finetunes

That's the Mount Rushmore of writing/conversational models that aren't like a trillion parameters.

[-]

alex20_202020@reddit

Gemma4 31b

How worse is MoE at writing? I like the speed of 26B much better.

[-]

Well, since the 31b is plenty fast enough for me, I usually just use that one, since it is definitely at least somewhat stronger at everything (other than speed), so I didn't use the 26b a4b MoE all that much by comparison. That said, I did test it a bit, just out of curiosity, and I was pretty surprised by how good it was at writing/chatting for a small MoE. Normally it should be like 1/4th as smart as the 31b MoE, or 1/5th as smart, or some huge dropoff like that, like, nowhere near as smart, if we look at other previous small MoEs compared to similarly sized dense models that came out around the same time from the same lab. But in this case, for some reason the 26b a4b was strangely strong for a small MoE, like maybe at least half as smart, or maybe even 60-70% as smart or something (well, in what limited testing I did, which wasn't very much), so, it is surprisingly strong.

So if for some reason you really need the extra speed, it is definitely the best small MoE for writing, by a wide margin.

But if the 31b being a bit slower doesn't matter so much (since, for writing or chatting it shouldn't normally be that big of a deal, unless maybe it is a huge-context and used in an RPG game where every second matters or something), then I would just go with the 31b, since it is definitely still stronger, so, might as well go with more strength (at least I view it that way anyway, if speed isn't too important in the use-case).

[-]

alex20_202020@reddit

since, for writing or chatting it shouldn't normally be that big of a deal,

I have old GPU and inference on CPU, even 26B gives me ~6 t/s at the start of context then to ~3-2 at 50k. Maybe you assume all in this sub have new 5090s :-)

Just to give some idea of how much I don't care about speed

I have tried recently the model that did not fit into RAM fully (also ~100B, but MoE - GLM 4.5 Air) and got 0.3 t/s. Just curious, what is still acceptable speed for you?

[-]

DeepOrangeSky@reddit

Yea I guess it depends on the situation, like, if it is just for casual writing type of stuff, I am pretty patient (probably much more than most people) since I enjoy reading what it writes for a scene or a crazy situation I come up with, and then comparing what one model writes compares to a different model or a different fine-tune, or even having it try multiple times, if temperature is somewhat high, to see the different ways it writes about the same situation, so I don't mind waiting 5 or 10 minutes even, or maybe longer, if I think it will be interesting or funny, plus also I can start reading it while it is still midway through writing it, if it is something quite long that it is writing. I mean, if I'm working on something while it is doing its thing, like if I'm doing some work or something, or if I am watching tv or reading a book or something, then, the time it takes doesn't matter to me much if it is 5 seconds or 5-10 minutes I guess. Since I am distracted with whatever else I am doing and can go continue with it whenever I want, so I don't usually get impatient with its speed, even if it is slow. I'm sure I would have my limits though of course, like if it was Llama 405b dense and going from NVMe cpu inference at like 0.01 tokens per second or something crazy, then that would be way too slow. But if the total time is 10 mins or less, or especially 5 minutes or less, per reply, then I don't usually mind too much, and just want what gives the highest quality or most interesting replies.

But, also depends what mood I am in or if I am busy with other things at the time, or what the exact use-case scenario is, or so on.

But yea, luckily Gemma4 26b MoE exists, so, that one is still quite strong even as an MoE. You should definitely try it and compare it a few times with the same prompts to both it and the 31b dense model and see if it seems strong enough, and if it is, then you get to use it at much higher speed than the 31b. And it is very nice since up until now, it wasn't like that with similarly sized MoE models to a similar sized dense model, they used to be MUCH much less smart and less good at writing, so, yea, it was a major breakthrough from Google to make 26b a4b so strong at writing. Still not sure how they did it. When I saw it, it reminded me of back a few months earlier when I saw the Elo scores of the Gemini Flash models compared to Gemini Pro, on LM Arena, how the Flash models barely even had lower Elo scores than the full sized Pro models, which was pretty crazy as well. So I guess it makes sense that they were able to use whatever breakthroughs they made to be able to do that with their Gemini Flash models to also make their Gemma4 26b a4b model so good at writing for a small MoE model.

[-]

mystery_biscotti@reddit

Right tool for the right job.

Not every local model can do absolutely everything well. Comparing to much larger models seems like comparing toddlers to teenagers. Maybe I'm just seeing it differently than others, though.

[-]

RedParaglider@reddit

GLM 4.5 air if you have the beef for it. GLM 4.5 is the last great creative Moe model we will likely ever get without insane fine tuning.

[-]

StableLlama@reddit

For story writing in the 30b range (i.e. what I can run on my 16 GB VRAM) Qwen 3.5 was ok, but then Gemma 4 came and created much nicer prose. Qwen 3.6 is at the same level and I still couldn't figure out in which situations I like Gemma 4 and in which I like Qwen 3.6 better.

For all of those models I use an Heretic version and 4 bit quants.

Never used Sonnet or any other closed model for that as I don't like the limited thinking of those censored models.

[-]