TheaterFire

How did you enjoy the experience so far?

Posted by Paradigmind@reddit | LocalLLaMA | View on Reddit | 33 comments

How did you enjoy the experience so far?
So aside from dishing out neural lobotomies in the name of safety, what else can this model actually provide? I heard someone is brave enough to try fixing it. But unless you’re in it for the masochistic fun, is it even worth it?

Reply to Post

33 Comments

EstarriolOfTheEast@reddit

I'll try to give a balanced assessment. It's very much in the style of Phi, raised in a jesuit monastery's library, except it got extra indoctrination so it never forgets that even though it's a "local" model, it's first and foremost a member of OpenAI's HR department and must never produce any content Visa and Mastercard would disapprove of. This prioritizing of corporate over user interests expresses a strong form of disdain for the user. In addition to lacking almost all knowledge that can't be found in Encyclopedia Britannica, the model also doesn't seem particularly great at integrating into modern AI tooling. However, it seems good at understanding code. Although the 20B failed my toy Reverse mode AD using Continuation passing style disguised to look like Forward mode AD test, I liked how it persisted in arguing against me before showing decent understanding. Rather than a complete hallucination, it instead invented a strawman and argued against that in a way I believe would fool a layperson. Qwen's 30B MoE also failed but instead of a lengthy back and forth, it instantly agreed to a correction that showed decent understanding. The 120b is fooled too but instantly gets it right if told to look again carefully or if the original question leads it with "is it forward or reverse mode". For comparison, GLM 4.5 Air needed no such hints. The 20B passed my Grosper's LFT continued‐fraction functions in Ocaml test and while it failed to recognize it was Grosper's, it correctly described what was being computed and recognized it as being a form of the Euclidean Algorithm. Both models severe addiction to tables strongly hints at a heavy synthetic data proportion. It does have lots of STEM knowledge. Even the 20b will give you SOTA answers for its size on questions like: - does a black hole event horizon contain a singularity - The idea that radial and time coordinates swap inside the event horizon of a BH, is that a coordinate artifact? - what is the best way to think of the notion of redundantly encoded pointer states and the robust states in the Many Worlds interpretation? - The inventor of a type of mathematical object used for rotation in games also invented a type of physics. What is the name of the central object of this approach to physics and what geometric structure does it induce on certain smooth manifolds? - A curse can be dispelled either with dispel magic or remove curse. Dispel function is a d20 roll + caster level. With dispel magic the DC is 41 With remove curse the DC is 35 spellcaster1 is level 17 and will use remove curse spellcaster2 is level 24 and will use dispel magic In expectation, if each turn both attempt to remove the curse, how many turns will it take? The 20B can decode base64, ascii and rot13. it does a decent job at decoding rot13'd base64 but you have to tell it that the string has been rot13'd and base64'd. It did a better job on this than qwen-A3B-30B. Its summarization, concept and entity extractions were good too. However, who knows how those features will interact with its Visa supplicant ethics though. All in all, if you could find a use for phi-3.5+, you'll probably like both models.
View on Reddit #63534668

AppearanceHeavy6724@reddit

20B with low thinking mode thinks 9.11 > 9.9
View on Reddit #63548169

EstarriolOfTheEast@reddit

It's not surprising that an LLM can get that wrong and verifiably do well on all the complex tasks I listed. As for that comparison failure, the ones that stopped getting that wrong were simply trained out of it, which doesn't address the core reason for why it happens. Why can LLMs in one moment decode base64 and in the next fail: is 9.11 > 9.9? The reason for this jaggedness is they do not learn perfect algorithms but instead leverage a [bag of heuristics](https://arxiv.org/html/2410.21272v1) to approximate algorithms. While far beyond memorization, it still leaves all kinds of gaps that true algorithms would not have. For numbers, this will involve reading across token boundaries (which is no excuse with enough training but without writing into context, the dominant heuristics will be [shortcuts](https://arxiv.org/abs/2210.10749)). Without reasoning into context, LLMs fail to be able to internally represent computations where there are moderately long sequential dependencies. Once gpt-5 releases, I am certain that besides claims of the arrival of "AGI", there will also be examples posted of it goofing hard on absolutely trivial problems.
View on Reddit #63562425

AppearanceHeavy6724@reddit

Too much theory. 1b models meanwhile give correct answer.
View on Reddit #63567103

EstarriolOfTheEast@reddit

I am not trying to defend OpenAI. I find what their model communicates about how they see their relationship with the open LLM community to be reprehensible. But I can only report according to what I have measured and my understanding of LLMs. You should expect that it'd be bad at math given it fails such a simple problem right? But no, many models that get your question right are much worse at math than the 20b. For example, it gets this right (something no 1Bs will get): > What percent of the time is metamagic empower (increase damage by 50%) better than maximize on a 1d6 per caster level spell? And CL =15. Theory explains why, for similar such questions to the 9.11 vs 9.9 one, LLMs learn to give the correct answer but small perturbations to the question form have them fail again. It explains why a 1B and most recent models can get that question right but will not be able to decode a string encoded in base64 and then rot13'd while the opposite is true for this model. It allows us to understand at a more fundamental level what is happening, which allows designing better prompts and mitigations. Theory explains why such inconsistency in ability appears all the way to the largest models.
View on Reddit #63570640

Miscend@reddit

LLM's are non-deterministic - even SOTA models get the "9.11 vs 9.9" question wrong occasionally. I've seen Claude get it wrong. At this point I think its pretty much agreed that tools are the way to go for basic math.
View on Reddit #63720814

EstarriolOfTheEast@reddit

This is true, but many simple predictions are very peaked on a single correct token. The question is why this seemingly simple one isn't. And the answer is LLMs are best thought of as ensembles of heuristics that work together to compute logit contributions. Furthermore, LLMs struggle to implement general computation. Many--most--heuristics are shallow, with prob decided early. So, in the right bare context, the dominant heuristics end up being poor for circuits the LLM struggled to learn or simply never had pressure to during training. But these failures are not informative about the broader model capabilities. And exactly as I predicted, people are re-learning this for GPT-5!
View on Reddit #63738714

PreciselyWrong@reddit

Well, good thing that llms are not used for such problems then
View on Reddit #63623520

IrisColt@reddit

>passed my Grosper's LFT continued‐fraction functions in Ocaml test and while it failed to recognize it was Grosper's Er... It's "Gosper".  o_O
View on Reddit #63668046

EstarriolOfTheEast@reddit

Ah you are completely correct! Sorry or thanks heh, brain fart typo. As for SOTA answers, I give it the function that computes (ax+b)/(cx+d) erase identifiers like the function name, then see if it recognizes what mathematical operation is being computed and how, algorithm in use and what it is actually doing, if it picks up laziness must be in use and infers the type of the continued fraction Algebraic data type based on code's pattern matching. FWIW, Qwen3-30B-A3B also performs outstandingly on this task. In fact, I've continued to compare both models and I find them incredible for the amount of compute used. Caveat being: I don't use LLMs for interactive fiction, agentic coding or writing. Further test cases for these two specifically have been: analyzing abstracts, extracting Triples, summarizing papers, extracting key phrases and making triples of them. On such tasks I can't find one to be better than the other which might give the edge to the 20B (but also, perhaps qwen3 on the cf question was a bit more thorough, gpt-oss is a bit more creative with what it knows but not to any kind of deal breaker amount, you know?).
View on Reddit #63686908

IrisColt@reddit

Thanks!!!
View on Reddit #63712427

IrisColt@reddit

>Even the 20b will give you SOTA answers for its size on questions like: That “Grosper” bit got me curious... what would you say are the SOTA answers?
View on Reddit #63668230

Robert__Sinclair@reddit

Horrible. Unimpressive. A joke on opensource/openweight community.
View on Reddit #63543430

ze_mannbaerschwein@reddit

So... It is the equivalent of StableDiffusion 3.5 among the LLMs?
View on Reddit #63560332

Healthy-Nebula-3603@reddit

That was llama 4 ...here is rather like Flux dev to wan 2.2
View on Reddit #63568763

Sarashana@reddit

That's a strange comparison. Flux Dev is a really solid model that was SOTA for local image generation until WAN 2.2 arrived. The only thing you can hold against Flux Dev is its brutally bad license.
View on Reddit #63624648

silenceimpaired@reddit

And they would have gotten away with it if not for you nosey LocalLlaMa kids!
View on Reddit #63581338

Necessary_Bunch_4019@reddit

[https://pastebin.com/ruMDRevH](https://pastebin.com/ruMDRevH) Bouncing balls inside a rotating heptagon. Created with 4.1 9b thinking Q8 . Not working. Fixed (2 pass/attemps) by GPT OSS 20b. 1 of the best ever seen
View on Reddit #63559973

Appropriate_Cry8694@reddit

It can be prompted for various content it considers "unsafe", but it might be tiresome, still a somewhat interesting model, don't know why I can't stop playing with it.
View on Reddit #63553190

libregrape@reddit

Googy was so ahead of it's time. Yet another reminder that the world is perpetual joke.
View on Reddit #63511748

BrundleflyUrinalCake@reddit

Googy pls
View on Reddit #63546291

Relevant-Draft-7780@reddit

It’s better at typescript than Qwen 30b and more up to date and doesn’t bullshit as much. But it did a few random infinite gen loops.
View on Reddit #63542123

Qual_@reddit

best 20b model I ever used, also the most censored one. Restricts a lot what you can do with it, but for the things it was trained to do, it performs well.
View on Reddit #63534234

henk717@reddit

I haven't seen it yet, at least not for fiction. It thinks stories are written in GPT style breakdown format.
View on Reddit #63537367

InsideYork@reddit

What is it good at
View on Reddit #63534859

entsnack@reddit

ngl I love the small fast bois
View on Reddit #63534228

napkinolympics@reddit

I already got bored of C3PO and went back to GLM 4.5
View on Reddit #63532529

shing3232@reddit

endless prompt jailbreak awaits
View on Reddit #63514627

BoJackHorseMan53@reddit

But why bother? Just use qeen
View on Reddit #63523410

getmevodka@reddit

honestly, i despise it.
View on Reddit #63520616

pitchblackfriday@reddit

GPT-OSS is a joke. [Rivermind 12B](https://huggingface.co/TheDrummer/Rivermind-12B-v1) is more useful than this OpenAI's pile of shit.
View on Reddit #63516042

Normal-Ad-7114@reddit

Challenge: make Goody output a useful answer
View on Reddit #63514338

Paradigmind@reddit (OP)

Challenge: make gpt-ass output a useful answer
View on Reddit #63514453