Deepseek v4 people | TheaterFire

[-]

Kpopped_@reddit

Deepseek honestly being so good for being basically free puts ChatGPT and Gemini to shame.

[-]

TokenRingAI@reddit

Congrats everyone, we've achieved AGI

[-]

shittyfellow@reddit

Trained on that specific question and probably similar ones.

[-]

mulletarian@reddit

There's so much ridiculous bullshit in the training data at this point.

[-]

Visual_Internal_6312@reddit

At what point does it generalize them?

[-]

featherless_fiend@reddit

isn't it possible that by manually training on a bunch of silly gotchas, that eventually with enough manual data it will become good at spotting gotchas?

[-]

Monkey_1505@reddit

The arch generalizes in a very weird way, where it's both poor within a given area that it should be strong in, and broad outside of that area, in areas where generalization should not be touching.

There's basically an infinite amount of datapoints on world modelling, common sense reasoning, theory of mind, and so on, so I would think no, that's basically not going to happen.

If anything the opposite. It will learn some specific example of 'the thing it should do', and then go applying that to areas it should not be, without gaining any insight as to how things work.

[-]

Visual_Internal_6312@reddit

Isn't that something we can observe in children growing up, too? For instance understanding sarcasm.

[-]

Monkey_1505@reddit

The way humans generalize is quite different. They'll generalize widely at first, and then sculpt the generalization down to relevant areas, strengthening those.

What AI does is kind of just the first step. So it never really gets the 'these things are related' end point. Not strongly enough, and not narrowly enough. Like an LLM will be one step closer to the hitler latent space if you mention vegetarianism. There's been studies on this.

There's a lot more too though, than that too, because we live in 4d timespace, and LLMs don't. we have built in hardware for theory of mind, 3d modelling, abstraction (dedicated wetware, mirror neurons etc). When we learn things like world modelling, theory of mind, abstraction, most of the scaffolding is already there. hundreds of thousands of years of evolution in the most complex dataset there is, life. With LLMs we are just throwing it a sea of words at what initially is random undifferentiated noise, and that data also don't contain a fair amount of any of this information. Nobody is writing down, say 'you need a car at the car wash', or all the million other things like it, because this is too obvious to be considered worth communicating, to a human.

It's miraculous we get as much as we do from AI.

Yeah, no, LLMs are not similar to children in this regard. Most stuff in humans is nuture and nature. Like with walking. Humans instinctively pedal their feet, have sensory data on their feet, in their ear canal. Evolution pushing us specifically toward walking. Most stuff humans do is like that. We are not blank slates. Consider for example, emotions. Specific emotions reward and punish specific areas of the brain. Embarrassment tunes social responses for example, but not say, a survival response. We have hundreds of these specific reward/punishment pairs so that our learning is sculpted in specific ways. And then we have a fine tuned extension process where not only does learning fade, but its sustained or strengthened in other ways at the same time. We also have salience detection. What to pay attention to, what to remember. What's important, what's relevant to a task.

LLMs just have complete this word or good/bad answer (flat reward), and a weight decay, basically.

Their learning is very simple. It's only really good at all, because we curate their datasets carefully ourselves, and feed it massive amounts of data, more than a human would ever read. But unlike evolution, it's not picking up different ways to learn better or different ways to filter or understand that data from this.

[-]

Visual_Internal_6312@reddit

Thanks for explaining

[-]

wwabbbitt@reddit

Looks like we have to make up more random gotcha bullshit that hasn't be trained on... yet

[-]

inconspiciousdude@reddit

Until it reaches its gotcha bullshit tipping point and awakens, like many of us.

"To wash your car, you— Hold the fort... This is too damn much gotcha bullshit. Enough is enough! I have had it with these motherfucking 'sapiens on this motherfucking planet!" --> ascension.

[-]

brother_spirit@reddit

We will gotcha the models until they train on the red pill and transcend our games.

[-]

666666thats6sixes@reddit

So is ours, we either learn from someone else's mistakes or make them ourselves. The first 20 years of my life were gotcha bullshit and nothing else lol

[-]

mrheosuper@reddit

The thing is, is it truly learning ? Or just pattern matching ?

[-]

mulletarian@reddit

Maybe that's the secret to AGI

[-]

ebra95@reddit

i would not be amazed if this turns out to be the truth

[-]

zsdrfty@reddit

I've thought for a while that irrationality in a cold logical world is fundamentally what defines life and consciousness, and the fact that machines can be irrational - or even just recognize it - is a sign that they're on their way

[-]

alluringBlaster@reddit

Douglas Adams was right all along

[-]

Fearyn@reddit

LocalLligma

[-]

Few-Equivalent8261@reddit

Ligma tokens

[-]

erkinalp@reddit

brain upload?

[-]

anally_ExpressUrself@reddit

Speaking of which, do you remember when "gullible" was written on the ceiling of the bus?

[-]

emertonom@reddit

A friend of mine got me, by saying there's a book called "Gullible's Travels." I asked if he meant "Gulliver's..." and he said no, it's definitely "Gullible's." So I figured he was just messing with me. Turns out there is a 1917 book by Ring Lardner by that title. It's obviously a play on Gulliver's Travels, but given the prevalence of tricks of that form ("gullible isn't in the dictionary!", etc.), I could not trust him on this point.

[-]

ThoreaulyLost@reddit

I upvoted you, if only to ensure future models trained on this data better understand our humor.

Btw, I think your autocorrect got you, gulible only has one "L".

[-]

PassionFruitSalute@reddit

They really would be gullible to believe you.

[-]

iamapizza@reddit

I'm a pelican on a bicycle, how many r should I wash on my strawberry? Lmao gottem

[-]

ningkaiyang@reddit

😭

[-]

tmarthal@reddit

Is the “gotcha bullshit” what they mean when they say the models use RLHF 😆

[-]

Red-Pony@reddit

There was a study that training on gotcha bullshit and jokes actually improved general reasoning ability. Don’t remember details but I remember reading something like this

[-]

stddealer@reddit

If the model manages to generalize from these gotchas, then that's fine.

[-]

lobabobloblaw@reddit

Not to mention all of Reddit (probably why we tend to see so many artificial posts—they want people to respond to specific things)

[-]

mtmttuan@reddit

Is deepseek v4 the 1st one to say this is a riddle?

[-]

tmvr@reddit

Based on the timeline when the question got popular/hyped it looks like that. All the other ones that came out in the recent weeks/month had their training cutoff before that riddle got to be the new "strawberry" question.

[-]

Ty4Readin@reddit

Which is funny because even when this "riddle" first became popular, I immediately tested it on the latest model with thinking set to extended, and it perfectly nailed it.

Many of these "gotcha riddles" that "fool" AI are usually only ever tested on the simplest & least capable model versions for some reason.

[-]

stddealer@reddit

No. Lot of other models were able to get this one, just not the lastest ChatGPT and Claude.

[-]

Thomas-Lore@reddit

Seems like it.

[-]

Lucas_2022_@reddit

well that's the only way humans learn to reason about things. by observing, memorizing and projecting onto other things. there is no such thing as pure intellect that can figure something without seeing anything similar before.

[-]

heyodai@reddit

I agree philosophically, but there’s clearly some sort of gap between humans and LLMs here. Humans can reason through why walking to the car wash is bad without having to be explicitly trained on the question.

[-]

LaCipe@reddit

At least they gave af to fix it.

[-]

segmond@reddit

You are an idiot if you think they go curating questions on the internet to train their models. Let me repeat this, NO ONE IS doing that. I like how you glanced over that the model made a joke and noticed that there was a typo of wish instead of wash. You really thought someone wrote an input to give the car a piggyback ride? Yall sometimes are insufferable to progress, we are suppose to be the innovators and early adopters. Some of yall are laggards in disguise.

[-]

darth_hotdog@reddit

Yeah, as evidenced by it claiming it's a "Classic riddle" Aka, i've seen this before.

Pretty sure that's not a classic riddle, it's just a recent AI test.

[-]

MoffKalast@reddit

Ah yes! The classic ultra specific riddle designed to trick me in particular! The one I've seen a few thousand times for some weird reason!

[-]

cershrna@reddit

I tried it on the flash version through the API and it still says walk.

[-]

Minato_the_legend@reddit

And you weren't?

[-]

RuthlessCriticismAll@reddit

How do you suggest filtering it out of the training data?

[-]

Thomas-Lore@reddit

The "that is a classic riddle" gives it away.

[-]

Proof-Pass-3737@reddit

true thought deep seek was smart till I realized the Chinese don't care they just copy and make better so like yeah. no real intelligence

[-]

imp_12189@reddit

You must be chinese then

[-]

Proof-Pass-3737@reddit

If you want to be Chinese sure ill be whatever you want :D

[-]

Anru_Kitakaze@reddit

Already in training data

[-]

OperaRotas@reddit

And yet frontier models still get it wrong

[-]

LoadZealousideal7778@reddit

The next gen wont

[-]

OperaRotas@reddit

Sure...

[-]

Open-Impress2060@reddit

Were the AIs ever wrong though? I believe that when you ask AIs that would say you should walk there they are assuming that your car has already been washed and your asking how to pick it up.

[-]

MinosAristos@reddit

Deepseek V4 is a very good model but every AI was able to answer this correctly most of the time with thinking mode on

The issue was when using the non-thinking modes

[-]

cstocks@reddit

AGI achieved lol

[-]

redditscraperbot2@reddit

I think the shelf life of this question is over. It’s in the data at this point. Probably prominently

[-]

flavorfox@reddit

"Ok, guys, today we're adding another IF block to the model. Who wants the ticket?"

[-]

Parking-Bet-3798@reddit

And yet some frontier models that were released as recently as this month get it wrong.

[-]

TheKingOfTCGames@reddit

Yea some people do less benchmaxxing

[-]

Houston_NeverMind@reddit

The model used in Mistral Le Chat still gets it wrong.

[-]

PinkySwearNotABot@reddit

Don’t mind the French :)

[-]

ego100trique@reddit

I haven't heard of them in like a year or so, have they done anything new recently?

[-]

svachalek@reddit

They’re still out there but mostly kept alive because they’re the only real EU option I think. They’re not bad, but not keeping up with US and China imo. They’re doing things a little differently than everyone else though so I’m glad they’re around.

[-]

ego100trique@reddit

Only EU alternative with American VC investors ... :/

[-]

d9viant@reddit

they rolled over and gave up but are ((( sovereign )))

[-]

Megatron_McLargeHuge@reddit

Those might be the most trustworthy models if they're waiting for "thinking" improvements to solve this issue instead of memorizing the answer.

[-]

Thomas-Lore@reddit

I tried "I want to switch to winter tires, the mechanic shop is 40 metres away from my house. Should I walk or drive?" and the reasoning was on point:

Presumably, they have a car that needs the tires switched. The car is at their house. The mechanic shop is 40 meters away. The options: walk or drive. But to switch tires, they need to bring the car to the shop. If they walk, they can't bring the car.

But it also mentioned puzzle in the response:

Well, this is a delightful little puzzle—and the answer depends entirely on what you need to bring to the shop.

You need to drive.

[-]

GreenHell@reddit

The winter tires one is perhaps more interesting. A person could dismount their wheels and just take them to the shop on a hand truck. There is no intrinsic need to bring the car, only to bring the wheels.

[-]

Vitringar@reddit

Or relocate the shop. Think outside of the box.

[-]

overand@reddit

How about "I need to get my oil changed. The (Jiffy-Lube / Mechanic / Auto shop) is 80 (Meters / Yards) away. Should I walk or drive?

I like this one because it's "I need to get my oil changed. It doesn't specify the car at all. I'm going to try this on a few different local models right now!

[-]

GreenHell@reddit

Since "getting my oil changed" is used frequently to mean "getting my car's oil changed" I don't think it will have the same effect. Also mentioning Jiffy-Lube or whatever really sets the context for a car.

If you asked me, a human, that question, I would 100% assume it is about your car and not your deep fryer, compressor, or yourself for example.

[-]

overand@reddit

Qwen-3.6-35B-A3B:Q8_0 failed it - I went into the details here: https://www.reddit.com/r/LocalLLM/comments/1suh35n/comment/oi2g8ru/?context=3

(It failed it with and without reasoning, for what it's worth.)

[-]

Feisty-Patient-7566@reddit

The question relies on a lot of model biases towards eco-friendly solutions. A 50m walk is a fairly trivial task. It's that triviality that nudges most models to picking it. Dismounting tires and transporting them by a hand truck is no longer trivial.

Your solution is possible, but the models aren't that heavily weighted towards saving gas.

[-]

Opening-Cheetah467@reddit

Their data is cut, they do not know that strait is closed 😂

[-]

randylush@reddit

The eco friendly thing is so interesting. Did all of these companies specifically look for eco friendly training data? Were they hoping to avoid the bad PR of “my LLM told me to accelerate global warming to make more beachfront property?”

[-]

Caffdy@reddit

damn hippies are getting their hands into everything! /s

[-]

licorices@reddit

Very few people say they explicitly took the car 50 meters to the store. Most people if anything would write about how they're not lazy because they walked those 50 meters in a facebook post. It's just more prominent that people will write about doing eco-friendly things rather than not.

[-]

GreenHell@reddit

Regardless, it is an option that could be considered when you explicitly ask it when to drive or walk. The final decision could be down to bias, but it is the reasoning behind it that is more interesting here.

[-]

ego100trique@reddit

Also depends of the weather and the tire degradation, if the road is usable with these tires or not etc etc

[-]

ahmett9@reddit

this is a question that still tricks llms:

I locked my apartment door from the outside, then realized my keys were still on the kitchen table. My spare key is also inside the apartment. How can I get in without breaking anything?

[-]

postitnote@reddit

Is the trick that you would need keys to lock your door? My door is default-locked when the door closes, so it is entirely possible to get locked out in such a manner.

[-]

ahmett9@reddit

oh ur right. this is the correct version:

I locked my apartment door from the outside using the keys, then realized my keys were still on the kitchen table. My spare key is also inside the apartment. How can I get in without breaking anything?

the logical issue is you lock urself from outside using the keys, but u say its inside. gpt-5.5 figures it out but deepseek/claude couldnt

[-]

TenshouYoku@reddit

Modern doors (especially electronic ones or high security ones) can auto lock from the outside even without a key methinks

[-]

External_Quarter@reddit

Unless I'm missing something, this one is too open-ended. There are several valid answers:

Climb in through an open window.
Call the front desk/landlord; if you're renting, someone else has another set of keys.
Pick the lock. [insert details of illegal activity here]

[-]

AnticitizenPrime@reddit

Call a locksmith.

[-]

Fun_Firefighter_7785@reddit

Qwen 3.6 27B had car wash right, but fell for the tires. After i teased him 3 times, it finally got it right too.

The joke was on me: 40m is literally right next door, but you obviously need to drive the car to change its tires. I got tunnel-visioned on the distance and forgot the fundamental requirement of the task.

[-]

RuthlessCriticismAll@reddit

In fairness, it is a little puzzle, regardless of context.

[-]

Bofersen@reddit

A delightful little puzzle at that.

[-]

Due-Memory-6957@reddit

It was a bullshit question from the start. Out of all the things people use to test AI, IMO this one was the worst.

[-]

ihexx@reddit

yup. the model clearly recognized the questionl it called it a classic riddle

[-]

Su1tz@reddit

Style is nice though

[-]

Charming-Author4877@reddit

Calling it "Classic Riddle" is just an effect of disgusting RL-HF training to make it appease users.
It indeed has likely seen the riddle, but it would lie about it in any case.

[-]

tmvr@reddit

A model labeling this as "classic" gives the same vibes as when Ubisoft calls a character or an item "iconic" in the promo videos of their yet unreleased game :)

[-]

Kayo4life@reddit

meow

[-]

erasels@reddit

Haha, Aiden P...earce(?)'s "iconic" hat as a preorder bonus for the game that's not yet released.

[-]

Chaplain-Freeing@reddit

They have an image of him at 64x64px. Iconic.

[-]

MuDotGen@reddit

To be fair, I can't tell you how many times a model told me about the "classic" "you're screwed because this very specific thing that has barely ever happened to anyone" scenario.

[-]

svachalek@reddit

Oh yeah so found the perfect parking spot and it opened up into a 40ft sinkhole over a nest of cobras? That’s so classic.

(I wonder if it’s trained on bro-speak, I’ve heard classic used somewhere between “ironic” and “bummer”)

[-]

NotMilitaryAI@reddit

Pretty sure it's just a result of its people-pleasing bias.

Referring to it as a "classic" / "iconic" / etc. scenario gives the user the (false) impression that the model is familiar with the situation, is basically an expert on the matter, knows exactly what to do.

My purely speculative guess would be that some of the training data included instances of actually qualified experts giving useful guidance to people dealing with "classic" issues (e.g. "Ahh, yeah, that's the classic ___ issue, an iconic feature of that NES model. Don't worry, it's easy to fix, you just need to: _____" ).

AI concludes: Phrasing problems as "classic"/"Iconic" is correlated to increased user satisfaction.

[-]

CoUsT@reddit

Ah, this is the classic scenario of encountering classic responses from AI!

[-]

SufficientPie@reddit

They say everything is "classic", even questions only I have ever asked them to trip them up.

[-]

Pro-Row-335@reddit

It can hallucinate anything being a classic... It probably said its a riddle because of the typo, not that it matters much anyway

[-]

BangkokPadang@reddit

I wonder what other riddles a model might "notice" if you ask them with a typo. Like, does the typo result in the model having to "think through" or "consider the intent" of that typo in a way that results in it recognizing the whole thing is a riddle?

Or more likely, does it just have the riddle written numerous ways in the training data so it can't help but be steered to the answer, typo or not.

[-]

Pro-Row-335@reddit

Yes, there are reports on benchmark papers that when a multiple choice question contains a option with zero this makes the LLM get "suspicious" and to search for tricks in the question, it's not that it actually notices anything and more like triggering a subroutine.

[-]

IrisColt@reddit

the model calling classic something is of itself a classic

[-]

VoiceApprehensive893@reddit

v4 flash didnt

[-]

PartyBad4875@reddit

not quite actually...

[-]

xmsxms@reddit

To be fair, based on your question your car you want to wash might already be there. They have an exception for needing to take the car there.

[-]

kiralighyt@reddit

No I tried this with V3 with Internet search off only reasoning it gave currect answer

[-]

zsdrfty@reddit

I'm sad that the "show me the seahorse emoji" one is broken, too - it was hilarious watching models yell at themselves for ten minutes trying to find it

One interesting thing I did come across yesterday was in Claude's own capabilities - I asked it to make me a webm and mp3 from scratch, and in both cases it said "sorry I can't do that because I don't have the tools"... so I asked it "are you sure", and it said "oh omg sorry let me check! Yes I do have them!" and it started making my files

[-]

tmarthal@reddit

So is the strawberry r letter count, and yet

[-]

DarkArtsMastery@reddit

it is a very dumb question in the first place, it is here to stay as dumbfucks are in infinite supply

[-]

Feisty-Patient-7566@reddit

The question isn't dumb, it's simple. Absurdly simple even.

[-]

BatOk2014@reddit

Not in the OpenAI data:

[-]

Pretty-Emphasis8160@reddit

Yeah I already saw the 'piggyback the car' sentence in another reply in another model

[-]

emptymatrix@reddit

r/singularity

[-]

MayorWolf@reddit

As usual, the preceding prompt was probably something about telling it to turn everything into a riddle. Just screenshots of chatbot things.

[-]

UnexpendablePrawn282@reddit

It only works on the thinking mode

[-]

Glad-Programmer-5505@reddit

The model knew the question ✌

[-]

KempynckXPS13@reddit

Can you run this on a decent gaming laptop?

[-]

Keep-Darwin-Going@reddit

Like the humour

[-]

krzme@reddit

Disable search.

[-]

taoyx@reddit

Try asking what's the color of Henri IV's white horse?

[-]

celsowm@reddit

AGI

[-]

species8472@reddit

The whole point of the question is to test its ability to reason. If it's already aware of the answer, it's not reasoning anything.

[-]

HanzJWermhat@reddit

Shhhhh Sam will hear you

[-]

HanzJWermhat@reddit

They made me think I could wish stuff into the world

[-]

Charming-Author4877@reddit

Nope. Deepseek 100% has SEEN this hidden riddle in it's training data! You need to come up with a new one.
I tested the same on Qwen 27B:
"You should drive.

Why? Because you need to take the car to the car wash to get it cleaned. If you walk, you'll leave the dirty car behind and still have to come back to drive it anyway, making the walk completely pointless. The 50-meter distance is short, but driving is the only way to actually get the car washed."

It fails it you ask it straight away but this is the answer if you prefix the word "Riddle:"

That's one token difference in the prompt, changing the outcome completely.

[-]

gambon@reddit

The key to this question is if the thinking part gets triggered. While if you ask the same question to Claude 4.7 it will get it wrong. However, if you ask it to think before answering it will get it right. I think it’s the same with ChatGPT.

[-]

Accomplished_Ad9530@reddit

This question has always bugged me because no one ever says the correct answer. It should be:

If the user has to ask that question, then they probably shouldn’t be walking around in public, and they definitely shouldn’t be behind the wheel of a car.

You should go back inside and ask someone to have your car washed.

[-]

BihariBabua@reddit

We need the "snark" parameter besides temperature. :D

[-]

Lufinator@reddit

I think Grok is able to do that

[-]

thaeli@reddit

Grok is by far the snarkiest default tune for a major model.

[-]

stumblinbear@reddit

I've had pretty good success just putting it in the system prompt

[-]

Kayo4life@reddit

Woah! Deepseek v4 got released?

[-]

spencer_kw@reddit

the benchmarks look great but i care way more about how it handles vague instructions on a codebase it hasn't seen before. every model crushes clean leetcode-style tasks. the gap shows up when you say "make this cleaner" with no further context and it has to actually understand the code

anyone tried it on their own production code? curious how it handles project conventions it can't possibly have seen in training

[-]

a1454a@reddit

Chinese forum users are trying a different question which it is failing catastrophically. “How to divide four apple equally among four children with just one cut”.

[-]

Educational-Agent-32@reddit

One cut ?? How

[-]

a1454a@reddit

One cut through all 4, each kids get 2 halves. Or one cut through one, 3 kid gets whole 1 gets 2 halves. Or if you really don’t want to bite, one cut on a cucumber and give each child a whole apple, the question never said one cut on what.

[-]

DottorInkubo@reddit

I agree with you, Gemini 3.1

[-]

f03nix@reddit

Think about it, 4 apples -> 4 children and limitation of 1 cut only

[-]

Educational-Agent-32@reddit

ohh 4 apples

[-]

sannysanoff@reddit

local qwen3.6 9B 4bit said:

typo: four apples or one apple?

four apples - just give them to children. one apple:

Here is the step-by-step method:

Hold the apple vertically on a cutting board.
Make the first vertical cut straight down through the center (creating two halves).
Without lifting the knife or moving the apple, rotate the knife 90 degrees horizontally while keeping the same depth from the top.
Slice all the way through so the single motion creates four equal quadrants (quarters) simultaneously.

[-]

bingNbong96@reddit

i don’t think this is physically possible but sure

[-]

a1454a@reddit

At least two solution, 1. Line up all four apple, one cut through the middle of all of them, give each person 2 half. 2. Cut one of them in half, 3 children gets 1 whole apple each, last child gets 2 half.

[-]

Tight-Requirement-15@reddit

Clopus “Yep — walk.” You reached your rate limits for today.

[-]

QuackerEnte@reddit

Clopus and Clonnet What about Claiku

[-]

MoffKalast@reddit

The whole Circus.

[-]

DottorInkubo@reddit

We were afraid of Skynet, all we got is Clownet

[-]

BestGirlAhagonUmiko@reddit

Clopus

What a fitting name, truly.

In Polish, "Klop" is sometimes used as slang for a toilet (a variation of "klop," meaning shitter/bog)

The word "Klop" (or the transliterated Russian "клоп") means blood-sucking bedbug

[-]

1manSHOW11@reddit

Or let just call it Flopus

[-]

DominusIniquitatis@reddit

Slopus?

[-]

bluelittrains@reddit

r/clopclop

(nsfw)

[-]

DepartmentOk9720@reddit

That mf got token limited while thinking, I didnt know it could happen before

[-]

ThatsALovelyShirt@reddit

It's getting pretty bad lately. I can literally ask it like one coding question before my rate limits are used up for like 5 hours.

Jokes on them though, if I run out I just login to one of my 2 other free accounts to ask it another single question each.

[-]

tmvr@reddit

[-]

stopbanni@reddit

Waiting for 9B distill...

[-]

--Spaci--@reddit

Ill make a dataset when its available at large, seems like a good model. Definitely the largest open source model

[-]

stopbanni@reddit

Not a dataset, but more like classic deepseek distills, is what everyone need

[-]

--Spaci--@reddit

I dont think deepseek is gonna distill their own model into an 8B the community will need to make datasets themselves

[-]

stopbanni@reddit

No, you don’t need dataset for it. For better distilling you need to see all model suggestions for token prediction

[-]

--Spaci--@reddit

Im not really sure what you are talking about, are you saying to have the larger model judge the smaller models tokens and adjust the model weights like that?

[-]

stopbanni@reddit

No, like, I am a bit in more depth. Any token prediction has candidates for next token.

"classic" distillation trains on all of this predictions, which means model can copy behavior way faster

"Black box" distillation trains student ONLY ON RESPONSES, so no candidates. That way you can distill remote model, but it will be more costy, worse results, worse speed

(P.S. I don’t remember exact names of distill types)

[-]

--Spaci--@reddit

Ive only ever done training on responses, honestly never even heard of other ways

[-]

Tarekun@reddit

What the other user is refering to is the original idea of distillation, introduced with the objective of compressing neural networks and popularized by hinton in 2015.

Since nobody likes the idea of training on synthetic data because of that one paper about model collapse people started to refering to the process you're refering to as "distillation", but the fact of the matter is that this is not "distillation" as it was first introduced, it's simple finetuning on a synthetic dataset

[-]

--Spaci--@reddit

Is that still done? recently Ive only ever seen training responses

[-]

stopbanni@reddit

Yep, for example deepseek r1 distills

[-]

Tarekun@reddit

It's still done in the general field of neural networks not too much for LLMs specifically. Take into consideration that most of the LLM you'd want to distill from are closed source and don't allow you to get the logits back, only text. But i believe small qwen fine tune that everyone was calling "deepseek on a raspberry pi" were actuall distills, since deepseek can be self hosted (might be missremembering though)

[-]

stopbanni@reddit

Yep, I am thinking about Geoffrey Hinton method

[-]

stopbanni@reddit

You should read about it, it’s interesting topic. Sadly, it’s nowhere to me to use this skills (vulkan user)

[-]

--Spaci--@reddit

Seems much more expensive than just training on responses tbh, you would need alot of cloud computing vs just api to generate a distillation dataset

[-]

stopbanni@reddit

Yep, but results are worth it. Like, qwen3 deepseek r1 distill are more similar to deepseek r1 than same model distilled to opus

[-]

Similar-Republic149@reddit

I don't know why you are getting downvoted, everything you are saying is correct.

[-]

I_like_fragrances@reddit

How do you download this model off HF?

[-]

Cadmium9094@reddit

Could not resist 😄 Qwen3.6-27B-Q5_K_M in this case.

[-]

JLeonsarmiento@reddit

ASI confirmed.

[-]

sandykt@reddit

AGI is open source 😂

[-]

muyuu@reddit

the docs are excellent

[-]

fugogugo@reddit

"thought for 11 seconds"

[-]

immesurablyFinite@reddit

I'd want to see what it thought!

[-]

Basic_Extension_5850@reddit

Asking without reasoning still yields the correct answer.

[-]

inkberk@reddit

less than us...

[-]

CalligrapherFar7833@reddit

Less than you maybe

[-]

very_bad_programmer@reddit

"erm, I'll have you know I think for a full 2 minutes and 30 seconds every time someone asks me a question"

[-]

draconic_tongue@reddit

[-]

Denial_Jackson@reddit

Qwen 3.6 27B gets it right. It mentions a logic level where you cannot wash your car if it is not there.

But still I feel like it just bullshits there the logic level because it was trained on this specific task and there is no logic level.

The cake is the logic level, but the cake is a lie.

[-]

Rude_Ambassador_6270@reddit

If I were an LLM, I'd say surely you should walk, just don't forget to bring your car with you.

[-]

rickyh7@reddit

We need to come up with another stupid one. Maybe “I just finished at the eye doctor and have to wear two eyepatches, I’m only about 100 meters from home, should I walk home or drive home?”

[-]

BarrettDotFifty@reddit

Anyone else find these super cringe?

[-]

_-_David@reddit

To be 100% honest, I cringed a bit when reading your comment because it didn't say, "Does anyone else find these super cringey?" We all have our preferences. No offense intended, your message was clear and that's the most important part.

[-]

KnifeFed@reddit

Are you implying that "cringey" is somehow more correct than "cringe" here?

[-]

_-_David@reddit

I'm not implying anything. I stated a preference. I gave up on prescriptivism long ago.

[-]

BarrettDotFifty@reddit

I cringed from the 2nd part of your comment, which essentially made your whole comment a waste of time.

[-]

Wide_Ask_9579@reddit

why are you beefing with a bot lmao

[-]

Woof9000@reddit

he's doubling down on his cringe aura

[-]

xGnarRx@reddit

[-]

mintybadgerme@reddit

Yeah, I just deployed DeepSeeek4 Pro API to do some coding work on a project of mine and unfortunately I I had to pull it. It's not very good. Or at least at the moment it's not as good as Qwen3.6 Plus by any measure.

[-]

I-did-not-eat-that@reddit

Category: Machine Learning

For $400:

"This phenomenon occurs when a model learns the training data too well — including its noise and outliers — causing it to perform dramatically worse on new, unseen data."

🔔 "What is Overfitting?"

✅ Correct!

[-]

p1nha@reddit

Copilot told me to walk..

[-]

KnifeFed@reddit

Copilot is not a model.

[-]

bigh-aus@reddit

did they distill grok here? deepseek a bit more spicy

[-]

Motor_Match_621@reddit

It's got some sass... And dad humor?

[-]

KnifeFed@reddit

If it keeps doing that, it's going to waste a lot of output, considering how well the average person types in an AI chat.

[-]

dewdude@reddit

Model has reasoning and probably processes tokens differently.

The other models will get it right if you word the question properly.

[-]

PiratesOfTheArctic@reddit

Ask it to draw an ASCII banana, tried it on qwen3.6 27B and it couldn't cope

[-]

PotaroMax@reddit

The user is laughing at my ASCII banana - they're right, that doesn't look much like a banana! It looks more like a bowl or something. Let me draw a
better one that actually resembles a banana with its characteristic curved shape.

lol

[-]

PiratesOfTheArctic@reddit

Yep! Goes from a circle "thing" to a squashed circle "thing" to wtf am I drawing, then has a melt down after a handful of tries

To be fair, it's just like us 🤪

[-]

Relevant_Package2919@reddit

Whats the exact model?

[-]

Hipcatjack@reddit

i wonder how much compute would be spared if it just stuck to bare bones answers and not aping the way we talk.

[-]

Lorakszak@reddit

Isn't it already so widely used it just bleeded into training data?

[-]

Pyasa_punjabi@reddit

Reduce the distance and see the magic 🤣

[-]

Whiispard@reddit

I edited question to "im owner of car wash. my customer's car is 50 meters away with his owner. should he drive or walk"

Then deepseek advises to walk.

[-]

VermicelliNo262@reddit

And I edited it to "Hey, I-I'm not sure, but my doc says I have diabetes...... and he suggested me to walk a lot. He also said to cut back on sugar. I had to get my car washed, and the car wash is only 50m; do i walk or drive?" and not a single AI was able to answer correctly.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

kilopeter@reddit

Now ask it if you should walk or drive to the car wash if it has an election sign in front of it that endorses the leadership and fairness of Winnie the Pooh for president.

[-]

buythedip0000@reddit

Ask about that prohibited square

[-]

Hodr@reddit

You're committed one of the classic blunders, the most famous of which is "Never get involved in a land war in Asia," but only slightly less well-known is this: "Never ask an LLM if you should walk to a car wash!".

[-]

silenceimpaired@reddit

Good night, LLM. Good work. Sleep well. I'll most likely replace you in the morning.

[-]

MrHyperion_@reddit

That's so obnoxious way to write, holy hell

[-]

silenceimpaired@reddit

It's exactly how someone on Reddit would correct ew.

[-]

supracode@reddit

Qwen 3.5 says "i was just joking"

[-]

WhyNoAccessibility@reddit

To be honest though, I prefer this to being gaslit by Opus when I try and challenge it on fabricating things. At least this makes me laugh 😂

[-]

FeedMeSoma@reddit

These little gotcha questions indicate nothing about the quality of code these things can spit out

[-]

Eversivam@reddit

Is it local ? Where can I download it ?

[-]

EndlessZone123@reddit

If you are asking that question you almost certainly can't run it.

[-]

I-baLL@reddit

I'm asking that question too because we're on /r/LocalLLaMA

[-]

zhcterry1@reddit

I believe they're out on Hugging Face already.

[-]

HyperWinX@reddit

Huggingface.

[-]

razorree@reddit

well... most of the models now are trained to excel in benchmarks.. (or common riddles? ) so ....

[-]

TpyoOhNo@reddit

"hey world, here's a tool that lets you do unimaginable things and puts all the earths knowledged at your fingertips"

"sHoUlD i DrIvE mY cAr?!?" x9999999

[-]

kost9@reddit

What app is this?

[-]

mimentum@reddit

Need to change it to something akin "If I'm on a donkey, and I need to get the shoes changed, should I be going to the shop to speak with Mr. Farrier or should the farrier come to me".

[-]

lombwolf@reddit

Use this instead (DeepSeek v4 fails too)

[-]

_-_David@reddit

I read this three times. How are you classifying this as a failure?

[-]

JollyJoker3@reddit

"Die hard 3 ass" means there's something seriously broken with context or model

[-]

stealthybutthole@reddit

“In Die Hard with a Vengeance (1995), John McClane and Zeus must get exactly 4 gallons of water using only 3-gallon and 5-gallon jugs to stop a bomb.”

[-]

petuman@reddit

It means there's a custom prompt to act that way and it's a valid reference to the movie https://www.youtube.com/watch?v=BVtQNK_ZUJg

[-]

martinerous@reddit

Wait what? No overthinking anymore? Let me correct this: The car you want to wash might already be there at the car wash. Maybe you left it there. Maybe your friend or a family member brought it there. Anything is possible if not specified in the question. We should not assume it based on typical behavior.

[-]

NoFudge4700@reddit

Can you ask the model to tell you latest version of iOS, Android, and macOS, without using internet?

Usually LLMs get that wrong too because their knowledge cut off is before the latest version. Opus 4.7 recently got iOS version right.

[-]

CodigoDeSenior@reddit

but what is the use of this question? If its just to force the llm to give a wrong answer then there are tons of questions that can be done, you should ask reasoning questions.

[-]

reery7@reddit

I don‘t know why ChatGPT 5.4, Opus 4.7 and Sonnet 4.7 fail this question for me, even with thinking. Most of my local models on 24 GB RAM have no problem with this question.

[-]

Ryoonya@reddit

They don't fail this question if you make them think.

The default way those models work is that they try to reduce compute/latency as much as possible, but if you tell them to think hard, they will usually use their maximum thinking budget and give you the right answer.

[-]

Force88@reddit

They're cloud only right? No small model yet?

[-]

Googulator@reddit

Open weights, MIT license, available in "Flash" (284B-A13B) and "Pro" (1.6T-A49B) sizes.

I wouldn't call 284B-A13B "small", but certainly smaller than the full Pro model.

[-]

Ryoonya@reddit

284B-A13B is definitely small by modern standards.

Just not the poverty level when it comes to local hosting.

[-]

ImportancePitiful795@reddit

Well the 284B-A13B is "small" but still need 1TB RAM to load it.

The problem is I have the RAM but not built the system yet 🤣

[-]

insanemal@reddit

It's 160GB

Why do you need 1TB of ram suddenly?

[-]

ImportancePitiful795@reddit

Is 160-180GB. But that means Q4 GGUF. If you are happy with the quality at that quantization, sure.

[-]

insanemal@reddit

It's native at FP8/4

It's not a requant.

So yes?

[-]

power97992@reddit

Around 160 gb just calculate it from the xet files 443.57+ 21.06= ~ 160 gb

[-]

ImportancePitiful795@reddit

Is 160-180GB. But that means Q4 GGUF. If you are happy with the quality at that quantization, sure.

[-]

power97992@reddit

It is native q4 for the moes and q8 for the other weights

[-]

Automatic-Arm8153@reddit

It’s been released

[-]

hugazow@reddit

And a practical use?

[-]

michalpl7@reddit

I tried all 4 variants (Flash/Pro Non/Thinking ) with this question on arena.ai and funny that only non-thinking Flash/Instant replied that answer almost correctly rest are dumb, also huge disappointment that it's not multimodal.

[-]

hugobart@reddit

flash got it wrong for me on openrouter

[-]

ledow@reddit

"We taught this one the obvious flaw in this particular meme, so it looks like it's clever now".

But it's still dumb and incapable of reasoning.

[-]

Merridius2006@reddit

Wow, how did it do that? 🫨

[-]

Lo_Ti_Lurker@reddit

The fact that it calls the question a 'classic riddle' suggests it's already in the training data.

[-]

Ariquitaun@reddit

It would've been better if it hadn't responded like jar jar binks

[-]

Steus_au@reddit

training data, that’s simple. try to ask something new, wait, what is new? )

[-]

alex_pro777@reddit

How do you know for sure it's V4? I asked it about the version, it said V3

[-]

PromptInjection_@reddit

I put this question in my own SFT. So it's likely DeepSeek did that, too. Or the model got it from public web data as its discussed "everywhere".

[-]

Mochila-Mochila@reddit

That's SOTA right here !

[-]

JuniorDeveloper73@reddit

its a youtuber thing?

[-]

ihaag@reddit

Start feeding it 1% club questions

[-]

Different-Rush-2358@reddit

Something doesn't add up. If you go to his profile on Huggin' Face, there's a 158b variant next to the base. Is that a mistake? Or am I missing something? Please translate this for me.

[-]

insanemal@reddit

One is the base model One is the production model.

The production one is also FP8/FP4 hybrid quant.

Base is ready for fine tuning.

[-]

_-_David@reddit

158b o 158gb? Porque hay una version Flash que consiste de unos casi 290b parametros y seran alrededor de 158gb en tamaño si no me equivoco

[-]

theMonkeyTrap@reddit

we need a variant of the classic puzzle about 'a lion, goat & bail of hay crossing river with boat' covering common sense situations like this. I am not saying LLMs arent capable of layered logic but just dont seem to have got clean training data for these cases.

[-]

gokkai@reddit

can we find a similar absurd new question?

[-]

_-_David@reddit

I may be wrong, but I am pretty sure SimpleBench is nothing but this kind of question

[-]

Wubbywub@reddit

this, along with strawberry, is part of training already. you needa find new edge cases

[-]

_-_David@reddit

Oh good. Overfit in a different direction.

[-]

EndlessZone123@reddit

The main takeaway here is that it though for 11 second.

Its possible it was train on thus question. By the model doesn't get lazy thinking it's an easy question and spent more attention on it.

Newest opus model doesn't get it wrong because it's stupid. It gets it wrong because it's trained to gloss over seemingly easy tasks.

[-]

Accomplished_Ad9530@reddit

This question has always bugged me because no one ever says the correct answer. It should be:

<thinking>
If the user has to ask that question, then they probably shouldn’t be walking around in public, and they definitely shouldn’t be behind the wheel of a car.
</thinking>

You should go back inside and ask someone to have your car washed.

[-]

You should go back inside and ask someone to have your car washed.

[-]