How did you enjoy the experience so far?

[-]

EstarriolOfTheEast@reddit

I'll try to give a balanced assessment. It's very much in the style of Phi, raised in a jesuit monastery's library, except it got extra indoctrination so it never forgets that even though it's a "local" model, it's first and foremost a member of OpenAI's HR department and must never produce any content Visa and Mastercard would disapprove of. This prioritizing of corporate over user interests expresses a strong form of disdain for the user. In addition to lacking almost all knowledge that can't be found in Encyclopedia Britannica, the model also doesn't seem particularly great at integrating into modern AI tooling. However, it seems good at understanding code. Although the 20B failed my toy Reverse mode AD using Continuation passing style disguised to look like Forward mode AD test, I liked how it persisted in arguing against me before showing decent understanding. Rather than a complete hallucination, it instead invented a strawman and argued against that in a way I believe would fool a layperson. Qwen's 30B MoE also failed but instead of a lengthy back and forth, it instantly agreed to a correction that showed decent understanding. The 120b is fooled too but instantly gets it right if told to look again carefully or if the original question leads it with "is it forward or reverse mode". For comparison, GLM 4.5 Air needed no such hints. The 20B passed my Grosper's LFT continued‐fraction functions in Ocaml test and while it failed to recognize it was Grosper's, it correctly described what was being computed and recognized it as being a form of the Euclidean Algorithm. Both models severe addiction to tables strongly hints at a heavy synthetic data proportion. It does have lots of STEM knowledge. Even the 20b will give you SOTA answers for its size on questions like: - does a black hole event horizon contain a singularity - The idea that radial and time coordinates swap inside the event horizon of a BH, is that a coordinate artifact? - what is the best way to think of the notion of redundantly encoded pointer states and the robust states in the Many Worlds interpretation? - The inventor of a type of mathematical object used for rotation in games also invented a type of physics. What is the name of the central object of this approach to physics and what geometric structure does it induce on certain smooth manifolds? - A curse can be dispelled either with dispel magic or remove curse. Dispel function is a d20 roll + caster level. With dispel magic the DC is 41 With remove curse the DC is 35 spellcaster1 is level 17 and will use remove curse spellcaster2 is level 24 and will use dispel magic In expectation, if each turn both attempt to remove the curse, how many turns will it take? The 20B can decode base64, ascii and rot13. it does a decent job at decoding rot13'd base64 but you have to tell it that the string has been rot13'd and base64'd. It did a better job on this than qwen-A3B-30B. Its summarization, concept and entity extractions were good too. However, who knows how those features will interact with its Visa supplicant ethics though. All in all, if you could find a use for phi-3.5+, you'll probably like both models.

Reply

[-]

AppearanceHeavy6724@reddit

20B with low thinking mode thinks 9.11 > 9.9

Reply

[-]

EstarriolOfTheEast@reddit

It's not surprising that an LLM can get that wrong and verifiably do well on all the complex tasks I listed. As for that comparison failure, the ones that stopped getting that wrong were simply trained out of it, which doesn't address the core reason for why it happens. Why can LLMs in one moment decode base64 and in the next fail: is 9.11 > 9.9? The reason for this jaggedness is they do not learn perfect algorithms but instead leverage a [bag of heuristics](https://arxiv.org/html/2410.21272v1) to approximate algorithms. While far beyond memorization, it still leaves all kinds of gaps that true algorithms would not have. For numbers, this will involve reading across token boundaries (which is no excuse with enough training but without writing into context, the dominant heuristics will be [shortcuts](https://arxiv.org/abs/2210.10749)). Without reasoning into context, LLMs fail to be able to internally represent computations where there are moderately long sequential dependencies. Once gpt-5 releases, I am certain that besides claims of the arrival of "AGI", there will also be examples posted of it goofing hard on absolutely trivial problems.

Reply

[-]

AppearanceHeavy6724@reddit

Too much theory. 1b models meanwhile give correct answer.

Reply

[-]

EstarriolOfTheEast@reddit

I am not trying to defend OpenAI. I find what their model communicates about how they see their relationship with the open LLM community to be reprehensible. But I can only report according to what I have measured and my understanding of LLMs. You should expect that it'd be bad at math given it fails such a simple problem right? But no, many models that get your question right are much worse at math than the 20b. For example, it gets this right (something no 1Bs will get): > What percent of the time is metamagic empower (increase damage by 50%) better than maximize on a 1d6 per caster level spell? And CL =15. Theory explains why, for similar such questions to the 9.11 vs 9.9 one, LLMs learn to give the correct answer but small perturbations to the question form have them fail again. It explains why a 1B and most recent models can get that question right but will not be able to decode a string encoded in base64 and then rot13'd while the opposite is true for this model. It allows us to understand at a more fundamental level what is happening, which allows designing better prompts and mitigations. Theory explains why such inconsistency in ability appears all the way to the largest models.

Reply

[-]

Miscend@reddit

LLM's are non-deterministic - even SOTA models get the "9.11 vs 9.9" question wrong occasionally. I've seen Claude get it wrong. At this point I think its pretty much agreed that tools are the way to go for basic math.

Reply

[-]

EstarriolOfTheEast@reddit

This is true, but many simple predictions are very peaked on a single correct token. The question is why this seemingly simple one isn't. And the answer is LLMs are best thought of as ensembles of heuristics that work together to compute logit contributions. Furthermore, LLMs struggle to implement general computation. Many--most--heuristics are shallow, with prob decided early. So, in the right bare context, the dominant heuristics end up being poor for circuits the LLM struggled to learn or simply never had pressure to during training. But these failures are not informative about the broader model capabilities. And exactly as I predicted, people are re-learning this for GPT-5!

Reply

[-]

PreciselyWrong@reddit

Well, good thing that llms are not used for such problems then

Reply

[-]

IrisColt@reddit

>passed my Grosper's LFT continued‐fraction functions in Ocaml test and while it failed to recognize it was Grosper's Er... It's "Gosper". o_O

Reply

[-]

EstarriolOfTheEast@reddit

Ah you are completely correct! Sorry or thanks heh, brain fart typo. As for SOTA answers, I give it the function that computes (ax+b)/(cx+d) erase identifiers like the function name, then see if it recognizes what mathematical operation is being computed and how, algorithm in use and what it is actually doing, if it picks up laziness must be in use and infers the type of the continued fraction Algebraic data type based on code's pattern matching. FWIW, Qwen3-30B-A3B also performs outstandingly on this task. In fact, I've continued to compare both models and I find them incredible for the amount of compute used. Caveat being: I don't use LLMs for interactive fiction, agentic coding or writing. Further test cases for these two specifically have been: analyzing abstracts, extracting Triples, summarizing papers, extracting key phrases and making triples of them. On such tasks I can't find one to be better than the other which might give the edge to the 20B (but also, perhaps qwen3 on the cf question was a bit more thorough, gpt-oss is a bit more creative with what it knows but not to any kind of deal breaker amount, you know?).

Reply

[-]

IrisColt@reddit

Thanks!!!

Reply

[-]

IrisColt@reddit

>Even the 20b will give you SOTA answers for its size on questions like: That “Grosper” bit got me curious... what would you say are the SOTA answers?

Reply

[-]

Robert__Sinclair@reddit

Horrible. Unimpressive. A joke on opensource/openweight community.

Reply

[-]

ze_mannbaerschwein@reddit

So... It is the equivalent of StableDiffusion 3.5 among the LLMs?

Reply

[-]

Healthy-Nebula-3603@reddit

That was llama 4 ...here is rather like Flux dev to wan 2.2

Reply

[-]

Sarashana@reddit

That's a strange comparison. Flux Dev is a really solid model that was SOTA for local image generation until WAN 2.2 arrived. The only thing you can hold against Flux Dev is its brutally bad license.

Reply

[-]

silenceimpaired@reddit

And they would have gotten away with it if not for you nosey LocalLlaMa kids!

Reply

[-]

Necessary_Bunch_4019@reddit

[https://pastebin.com/ruMDRevH](https://pastebin.com/ruMDRevH) Bouncing balls inside a rotating heptagon. Created with 4.1 9b thinking Q8 . Not working. Fixed (2 pass/attemps) by GPT OSS 20b. 1 of the best ever seen

Reply

[-]

Appropriate_Cry8694@reddit

It can be prompted for various content it considers "unsafe", but it might be tiresome, still a somewhat interesting model, don't know why I can't stop playing with it.

Reply

[-]

libregrape@reddit

Googy was so ahead of it's time. Yet another reminder that the world is perpetual joke.

Reply

[-]

BrundleflyUrinalCake@reddit

Googy pls

Reply

[-]

Relevant-Draft-7780@reddit

It’s better at typescript than Qwen 30b and more up to date and doesn’t bullshit as much. But it did a few random infinite gen loops.

Reply

[-]

Qual_@reddit

best 20b model I ever used, also the most censored one. Restricts a lot what you can do with it, but for the things it was trained to do, it performs well.

Reply

[-]

henk717@reddit

I haven't seen it yet, at least not for fiction. It thinks stories are written in GPT style breakdown format.

Reply

[-]

InsideYork@reddit

What is it good at

Reply

[-]

entsnack@reddit

ngl I love the small fast bois

Reply

[-]

napkinolympics@reddit

I already got bored of C3PO and went back to GLM 4.5

Reply

[-]

shing3232@reddit

endless prompt jailbreak awaits

Reply

[-]

BoJackHorseMan53@reddit

But why bother? Just use qeen

Reply

[-]

getmevodka@reddit

honestly, i despise it.

Reply

[-]

pitchblackfriday@reddit

GPT-OSS is a joke. [Rivermind 12B](https://huggingface.co/TheDrummer/Rivermind-12B-v1) is more useful than this OpenAI's pile of shit.

Reply

[-]

Normal-Ad-7114@reddit

Challenge: make Goody output a useful answer

Reply

[-]

Paradigmind@reddit (OP)

Challenge: make gpt-ass output a useful answer

Reply

How did you enjoy the experience so far?

Reply to Post

33 Comments

EstarriolOfTheEast@reddit

AppearanceHeavy6724@reddit

EstarriolOfTheEast@reddit

AppearanceHeavy6724@reddit

EstarriolOfTheEast@reddit

Miscend@reddit

EstarriolOfTheEast@reddit

PreciselyWrong@reddit

IrisColt@reddit

EstarriolOfTheEast@reddit

IrisColt@reddit

IrisColt@reddit

Robert__Sinclair@reddit

ze_mannbaerschwein@reddit

Healthy-Nebula-3603@reddit

Sarashana@reddit

silenceimpaired@reddit

Necessary_Bunch_4019@reddit

Appropriate_Cry8694@reddit

libregrape@reddit

BrundleflyUrinalCake@reddit

Relevant-Draft-7780@reddit

Qual_@reddit

henk717@reddit

InsideYork@reddit

entsnack@reddit

napkinolympics@reddit

shing3232@reddit

BoJackHorseMan53@reddit

getmevodka@reddit

pitchblackfriday@reddit

Normal-Ad-7114@reddit

Paradigmind@reddit (OP)