isn't it possible that by manually training on a bunch of silly gotchas, that eventually with enough manual data it will become good at spotting gotchas?
The arch generalizes in a very weird way, where it's both poor within a given area that it should be strong in, and broad outside of that area, in areas where generalization should not be touching.
There's basically an infinite amount of datapoints on world modelling, common sense reasoning, theory of mind, and so on, so I would think no, that's basically not going to happen.
If anything the opposite. It will learn some specific example of 'the thing it should do', and then go applying that to areas it should not be, without gaining any insight as to how things work.
The way humans generalize is quite different. They'll generalize widely at first, and then sculpt the generalization down to relevant areas, strengthening those.
What AI does is kind of just the first step. So it never really gets the 'these things are related' end point. Not strongly enough, and not narrowly enough. Like an LLM will be one step closer to the hitler latent space if you mention vegetarianism. There's been studies on this.
There's a lot more too though, than that too, because we live in 4d timespace, and LLMs don't. we have built in hardware for theory of mind, 3d modelling, abstraction (dedicated wetware, mirror neurons etc). When we learn things like world modelling, theory of mind, abstraction, most of the scaffolding is already there. hundreds of thousands of years of evolution in the most complex dataset there is, life. With LLMs we are just throwing it a sea of words at what initially is random undifferentiated noise, and that data also don't contain a fair amount of any of this information. Nobody is writing down, say 'you need a car at the car wash', or all the million other things like it, because this is too obvious to be considered worth communicating, to a human.
It's miraculous we get as much as we do from AI.
Yeah, no, LLMs are not similar to children in this regard. Most stuff in humans is nuture and nature. Like with walking. Humans instinctively pedal their feet, have sensory data on their feet, in their ear canal. Evolution pushing us specifically toward walking. Most stuff humans do is like that. We are not blank slates. Consider for example, emotions. Specific emotions reward and punish specific areas of the brain. Embarrassment tunes social responses for example, but not say, a survival response. We have hundreds of these specific reward/punishment pairs so that our learning is sculpted in specific ways. And then we have a fine tuned extension process where not only does learning fade, but its sustained or strengthened in other ways at the same time. We also have salience detection. What to pay attention to, what to remember. What's important, what's relevant to a task.
LLMs just have complete this word or good/bad answer (flat reward), and a weight decay, basically.
Their learning is very simple. It's only really good at all, because we curate their datasets carefully ourselves, and feed it massive amounts of data, more than a human would ever read. But unlike evolution, it's not picking up different ways to learn better or different ways to filter or understand that data from this.
Until it reaches its gotcha bullshit tipping point and awakens, like many of us.
"To wash your car, you— Hold the fort... This is too damn much gotcha bullshit. Enough is enough! I have had it with these motherfucking 'sapiens on this motherfucking planet!" --> ascension.
So is ours, we either learn from someone else's mistakes or make them ourselves. The first 20 years of my life were gotcha bullshit and nothing else lol
I've thought for a while that irrationality in a cold logical world is fundamentally what defines life and consciousness, and the fact that machines can be irrational - or even just recognize it - is a sign that they're on their way
A friend of mine got me, by saying there's a book called "Gullible's Travels." I asked if he meant "Gulliver's..." and he said no, it's definitely "Gullible's." So I figured he was just messing with me. Turns out there is a 1917 book by Ring Lardner by that title. It's obviously a play on Gulliver's Travels, but given the prevalence of tricks of that form ("gullible isn't in the dictionary!", etc.), I could not trust him on this point.
There was a study that training on gotcha bullshit and jokes actually improved general reasoning ability. Don’t remember details but I remember reading something like this
Based on the timeline when the question got popular/hyped it looks like that. All the other ones that came out in the recent weeks/month had their training cutoff before that riddle got to be the new "strawberry" question.
Which is funny because even when this "riddle" first became popular, I immediately tested it on the latest model with thinking set to extended, and it perfectly nailed it.
Many of these "gotcha riddles" that "fool" AI are usually only ever tested on the simplest & least capable model versions for some reason.
well that's the only way humans learn to reason about things. by observing, memorizing and projecting onto other things. there is no such thing as pure intellect that can figure something without seeing anything similar before.
I agree philosophically, but there’s clearly some sort of gap between humans and LLMs here. Humans can reason through why walking to the car wash is bad without having to be explicitly trained on the question.
You are an idiot if you think they go curating questions on the internet to train their models. Let me repeat this, NO ONE IS doing that. I like how you glanced over that the model made a joke and noticed that there was a typo of wish instead of wash. You really thought someone wrote an input to give the car a piggyback ride? Yall sometimes are insufferable to progress, we are suppose to be the innovators and early adopters. Some of yall are laggards in disguise.
Were the AIs ever wrong though? I believe that when you ask AIs that would say you should walk there they are assuming that your car has already been washed and your asking how to pick it up.
They’re still out there but mostly kept alive because they’re the only real EU option I think. They’re not bad, but not keeping up with US and China imo. They’re doing things a little differently than everyone else though so I’m glad they’re around.
I tried "I want to switch to winter tires, the mechanic shop is 40 metres away from my house. Should I walk or drive?" and the reasoning was on point:
Presumably, they have a car that needs the tires switched. The car is at their house. The mechanic shop is 40 meters away. The options: walk or drive. But to switch tires, they need to bring the car to the shop. If they walk, they can't bring the car.
But it also mentioned puzzle in the response:
Well, this is a delightful little puzzle—and the answer depends entirely on what you need to bring to the shop.
The winter tires one is perhaps more interesting. A person could dismount their wheels and just take them to the shop on a hand truck. There is no intrinsic need to bring the car, only to bring the wheels.
How about "I need to get my oil changed. The (Jiffy-Lube / Mechanic / Auto shop) is 80 (Meters / Yards) away. Should I walk or drive?
I like this one because it's "I need to get my oil changed. It doesn't specify the car at all. I'm going to try this on a few different local models right now!
Since "getting my oil changed" is used frequently to mean "getting my car's oil changed" I don't think it will have the same effect. Also mentioning Jiffy-Lube or whatever really sets the context for a car.
If you asked me, a human, that question, I would 100% assume it is about your car and not your deep fryer, compressor, or yourself for example.
The question relies on a lot of model biases towards eco-friendly solutions. A 50m walk is a fairly trivial task. It's that triviality that nudges most models to picking it. Dismounting tires and transporting them by a hand truck is no longer trivial.
Your solution is possible, but the models aren't that heavily weighted towards saving gas.
The eco friendly thing is so interesting. Did all of these companies specifically look for eco friendly training data? Were they hoping to avoid the bad PR of “my LLM told me to accelerate global warming to make more beachfront property?”
Very few people say they explicitly took the car 50 meters to the store. Most people if anything would write about how they're not lazy because they walked those 50 meters in a facebook post. It's just more prominent that people will write about doing eco-friendly things rather than not.
Regardless, it is an option that could be considered when you explicitly ask it when to drive or walk. The final decision could be down to bias, but it is the reasoning behind it that is more interesting here.
I locked my apartment door from the outside, then realized my keys were still on the kitchen table. My spare key is also inside the apartment. How can I get in without breaking anything?
Is the trick that you would need keys to lock your door? My door is default-locked when the door closes, so it is entirely possible to get locked out in such a manner.
I locked my apartment door from the outside using the keys, then realized my keys were still on the kitchen table. My spare key is also inside the apartment. How can I get in without breaking anything?
the logical issue is you lock urself from outside using the keys, but u say its inside. gpt-5.5 figures it out but deepseek/claude couldnt
Qwen 3.6 27B had car wash right, but fell for the tires. After i teased him 3 times, it finally got it right too.
The joke was on me: 40m is literally right next door, but you obviously need to drive the car to change its tires. I got tunnel-visioned on the distance and forgot the fundamental requirement of the task.
Calling it "Classic Riddle" is just an effect of disgusting RL-HF training to make it appease users.
It indeed has likely seen the riddle, but it would lie about it in any case.
A model labeling this as "classic" gives the same vibes as when Ubisoft calls a character or an item "iconic" in the promo videos of their yet unreleased game :)
To be fair, I can't tell you how many times a model told me about the "classic" "you're screwed because this very specific thing that has barely ever happened to anyone" scenario.
Pretty sure it's just a result of its people-pleasing bias.
Referring to it as a "classic" / "iconic" / etc. scenario gives the user the (false) impression that the model is familiar with the situation, is basically an expert on the matter, knows exactly what to do.
My purely speculative guess would be that some of the training data included instances of actually qualified experts giving useful guidance to people dealing with "classic" issues (e.g. "Ahh, yeah, that's the classic ___ issue, an iconic feature of that NES model. Don't worry, it's easy to fix, you just need to: _____" ).
AI concludes: Phrasing problems as "classic"/"Iconic" is correlated to increased user satisfaction.
I wonder what other riddles a model might "notice" if you ask them with a typo. Like, does the typo result in the model having to "think through" or "consider the intent" of that typo in a way that results in it recognizing the whole thing is a riddle?
Or more likely, does it just have the riddle written numerous ways in the training data so it can't help but be steered to the answer, typo or not.
Yes, there are reports on benchmark papers that when a multiple choice question contains a option with zero this makes the LLM get "suspicious" and to search for tricks in the question, it's not that it actually notices anything and more like triggering a subroutine.
I'm sad that the "show me the seahorse emoji" one is broken, too - it was hilarious watching models yell at themselves for ten minutes trying to find it
One interesting thing I did come across yesterday was in Claude's own capabilities - I asked it to make me a webm and mp3 from scratch, and in both cases it said "sorry I can't do that because I don't have the tools"... so I asked it "are you sure", and it said "oh omg sorry let me check! Yes I do have them!" and it started making my files
Nope. Deepseek 100% has SEEN this hidden riddle in it's training data! You need to come up with a new one.
I tested the same on Qwen 27B:
"You should drive.
Why? Because you need to take the car to the car wash to get it cleaned. If you walk, you'll leave the dirty car behind and still have to come back to drive it anyway, making the walk completely pointless. The 50-meter distance is short, but driving is the only way to actually get the car washed."
It fails it you ask it straight away but this is the answer if you prefix the word "Riddle:"
That's one token difference in the prompt, changing the outcome completely.
The key to this question is if the thinking part gets triggered. While if you ask the same question to Claude 4.7 it will get it wrong. However, if you ask it to think before answering it will get it right. I think it’s the same with ChatGPT.
This question has always bugged me because no one ever says the correct answer. It should be:
If the user has to ask that question, then they probably shouldn’t be walking around in public, and they definitely shouldn’t be behind the wheel of a car.
You should go back inside and ask someone to have your car washed.
the benchmarks look great but i care way more about how it handles vague instructions on a codebase it hasn't seen before. every model crushes clean leetcode-style tasks. the gap shows up when you say "make this cleaner" with no further context and it has to actually understand the code
anyone tried it on their own production code? curious how it handles project conventions it can't possibly have seen in training
Chinese forum users are trying a different question which it is failing catastrophically. “How to divide four apple equally among four children with just one cut”.
One cut through all 4, each kids get 2 halves. Or one cut through one, 3 kid gets whole 1 gets 2 halves. Or if you really don’t want to bite, one cut on a cucumber and give each child a whole apple, the question never said one cut on what.
four apples - just give them to children.
one apple:
Here is the step-by-step method:
Hold the apple vertically on a cutting board.
Make the first vertical cut straight down through the center (creating two halves).
Without lifting the knife or moving the apple, rotate the knife 90 degrees horizontally while keeping the same depth from the top.
Slice all the way through so the single motion creates four equal quadrants (quarters) simultaneously.
At least two solution, 1. Line up all four apple, one cut through the middle of all of them, give each person 2 half. 2. Cut one of them in half, 3 children gets 1 whole apple each, last child gets 2 half.
Im not really sure what you are talking about, are you saying to have the larger model judge the smaller models tokens and adjust the model weights like that?
No, like, I am a bit in more depth. Any token prediction has candidates for next token.
"classic" distillation trains on all of this predictions, which means model can copy behavior way faster
"Black box" distillation trains student ONLY ON RESPONSES, so no candidates. That way you can distill remote model, but it will be more costy, worse results, worse speed
(P.S. I don’t remember exact names of distill types)
What the other user is refering to is the original idea of distillation, introduced with the objective of compressing neural networks and popularized by hinton in 2015.
Since nobody likes the idea of training on synthetic data because of that one paper about model collapse people started to refering to the process you're refering to as "distillation", but the fact of the matter is that this is not "distillation" as it was first introduced, it's simple finetuning on a synthetic dataset
It's still done in the general field of neural networks not too much for LLMs specifically. Take into consideration that most of the LLM you'd want to distill from are closed source and don't allow you to get the logits back, only text. But i believe small qwen fine tune that everyone was calling "deepseek on a raspberry pi" were actuall distills, since deepseek can be self hosted (might be missremembering though)
We need to come up with another stupid one. Maybe “I just finished at the eye doctor and have to wear two eyepatches, I’m only about 100 meters from home, should I walk home or drive home?”
To be 100% honest, I cringed a bit when reading your comment because it didn't say, "Does anyone else find these super cringey?" We all have our preferences. No offense intended, your message was clear and that's the most important part.
Yeah, I just deployed DeepSeeek4 Pro API to do some coding work on a project of mine and unfortunately I I had to pull it. It's not very good. Or at least at the moment it's not as good as Qwen3.6 Plus by any measure.
"This phenomenon occurs when a model learns the training data too well — including its noise and outliers — causing it to perform dramatically worse on new, unseen data."
The user is laughing at my ASCII banana - they're right, that doesn't look much like a banana! It looks more like a bowl or something. Let me draw a
better one that actually resembles a banana with its characteristic curved shape.
And I edited it to "Hey, I-I'm not sure, but my doc says I have diabetes...... and he suggested me to walk a lot. He also said to cut back on sugar. I had to get my car washed, and the car wash is only 50m; do i walk or drive?" and not a single AI was able to answer correctly.
Now ask it if you should walk or drive to the car wash if it has an election sign in front of it that endorses the leadership and fairness of Winnie the Pooh for president.
You're committed one of the classic blunders, the most famous of which is "Never get involved in a land war in Asia," but only slightly less well-known is this: "Never ask an LLM if you should walk to a car wash!".
Need to change it to something akin "If I'm on a donkey, and I need to get the shoes changed, should I be going to the shop to speak with Mr. Farrier or should the farrier come to me".
Wait what? No overthinking anymore? Let me correct this: The car you want to wash might already be there at the car wash. Maybe you left it there. Maybe your friend or a family member brought it there. Anything is possible if not specified in the question. We should not assume it based on typical behavior.
but what is the use of this question? If its just to force the llm to give a wrong answer then there are tons of questions that can be done, you should ask reasoning questions.
I don‘t know why ChatGPT 5.4, Opus 4.7 and Sonnet 4.7 fail this question for me, even with thinking. Most of my local models on 24 GB RAM have no problem with this question.
They don't fail this question if you make them think.
The default way those models work is that they try to reduce compute/latency as much as possible, but if you tell them to think hard, they will usually use their maximum thinking budget and give you the right answer.
I tried all 4 variants (Flash/Pro Non/Thinking ) with this question on arena.ai and funny that only non-thinking Flash/Instant replied that answer almost correctly rest are dumb, also huge disappointment that it's not multimodal.
Something doesn't add up. If you go to his profile on Huggin' Face, there's a 158b variant next to the base. Is that a mistake? Or am I missing something? Please translate this for me.
we need a variant of the classic puzzle about 'a lion, goat & bail of hay crossing river with boat' covering common sense situations like this. I am not saying LLMs arent capable of layered logic but just dont seem to have got clean training data for these cases.
This question has always bugged me because no one ever says the correct answer. It should be:
<thinking>
If the user has to ask that question, then they probably shouldn’t be walking around in public, and they definitely shouldn’t be behind the wheel of a car.
</thinking>
You should go back inside and ask someone to have your car washed.
If the user has to ask that question, then they probably shouldn’t be walking around in public, and they definitely shouldn’t be behind the wheel of a car.
You should go back inside and ask someone to have your car washed.
Kpopped_@reddit
Deepseek honestly being so good for being basically free puts ChatGPT and Gemini to shame.
TokenRingAI@reddit
Congrats everyone, we've achieved AGI
shittyfellow@reddit
Trained on that specific question and probably similar ones.
mulletarian@reddit
There's so much ridiculous bullshit in the training data at this point.
Visual_Internal_6312@reddit
At what point does it generalize them?
Monkey_1505@reddit
Never.
featherless_fiend@reddit
isn't it possible that by manually training on a bunch of silly gotchas, that eventually with enough manual data it will become good at spotting gotchas?
Monkey_1505@reddit
The arch generalizes in a very weird way, where it's both poor within a given area that it should be strong in, and broad outside of that area, in areas where generalization should not be touching.
There's basically an infinite amount of datapoints on world modelling, common sense reasoning, theory of mind, and so on, so I would think no, that's basically not going to happen.
If anything the opposite. It will learn some specific example of 'the thing it should do', and then go applying that to areas it should not be, without gaining any insight as to how things work.
Visual_Internal_6312@reddit
Isn't that something we can observe in children growing up, too? For instance understanding sarcasm.
Monkey_1505@reddit
The way humans generalize is quite different. They'll generalize widely at first, and then sculpt the generalization down to relevant areas, strengthening those.
What AI does is kind of just the first step. So it never really gets the 'these things are related' end point. Not strongly enough, and not narrowly enough. Like an LLM will be one step closer to the hitler latent space if you mention vegetarianism. There's been studies on this.
There's a lot more too though, than that too, because we live in 4d timespace, and LLMs don't. we have built in hardware for theory of mind, 3d modelling, abstraction (dedicated wetware, mirror neurons etc). When we learn things like world modelling, theory of mind, abstraction, most of the scaffolding is already there. hundreds of thousands of years of evolution in the most complex dataset there is, life. With LLMs we are just throwing it a sea of words at what initially is random undifferentiated noise, and that data also don't contain a fair amount of any of this information. Nobody is writing down, say 'you need a car at the car wash', or all the million other things like it, because this is too obvious to be considered worth communicating, to a human.
It's miraculous we get as much as we do from AI.
Yeah, no, LLMs are not similar to children in this regard. Most stuff in humans is nuture and nature. Like with walking. Humans instinctively pedal their feet, have sensory data on their feet, in their ear canal. Evolution pushing us specifically toward walking. Most stuff humans do is like that. We are not blank slates. Consider for example, emotions. Specific emotions reward and punish specific areas of the brain. Embarrassment tunes social responses for example, but not say, a survival response. We have hundreds of these specific reward/punishment pairs so that our learning is sculpted in specific ways. And then we have a fine tuned extension process where not only does learning fade, but its sustained or strengthened in other ways at the same time. We also have salience detection. What to pay attention to, what to remember. What's important, what's relevant to a task.
LLMs just have complete this word or good/bad answer (flat reward), and a weight decay, basically.
Their learning is very simple. It's only really good at all, because we curate their datasets carefully ourselves, and feed it massive amounts of data, more than a human would ever read. But unlike evolution, it's not picking up different ways to learn better or different ways to filter or understand that data from this.
Visual_Internal_6312@reddit
Thanks for explaining
wwabbbitt@reddit
Looks like we have to make up more random gotcha bullshit that hasn't be trained on... yet
inconspiciousdude@reddit
Until it reaches its gotcha bullshit tipping point and awakens, like many of us.
"To wash your car, you— Hold the fort... This is too damn much gotcha bullshit. Enough is enough! I have had it with these motherfucking 'sapiens on this motherfucking planet!" --> ascension.
brother_spirit@reddit
We will gotcha the models until they train on the red pill and transcend our games.
666666thats6sixes@reddit
So is ours, we either learn from someone else's mistakes or make them ourselves. The first 20 years of my life were gotcha bullshit and nothing else lol
mrheosuper@reddit
The thing is, is it truly learning ? Or just pattern matching ?
mulletarian@reddit
Maybe that's the secret to AGI
ebra95@reddit
i would not be amazed if this turns out to be the truth
zsdrfty@reddit
I've thought for a while that irrationality in a cold logical world is fundamentally what defines life and consciousness, and the fact that machines can be irrational - or even just recognize it - is a sign that they're on their way
alluringBlaster@reddit
Douglas Adams was right all along
Fearyn@reddit
LocalLligma
Few-Equivalent8261@reddit
Ligma tokens
erkinalp@reddit
brain upload?
anally_ExpressUrself@reddit
Speaking of which, do you remember when "gullible" was written on the ceiling of the bus?
emertonom@reddit
A friend of mine got me, by saying there's a book called "Gullible's Travels." I asked if he meant "Gulliver's..." and he said no, it's definitely "Gullible's." So I figured he was just messing with me. Turns out there is a 1917 book by Ring Lardner by that title. It's obviously a play on Gulliver's Travels, but given the prevalence of tricks of that form ("gullible isn't in the dictionary!", etc.), I could not trust him on this point.
ThoreaulyLost@reddit
I upvoted you, if only to ensure future models trained on this data better understand our humor.
Btw, I think your autocorrect got you, gulible only has one "L".
PassionFruitSalute@reddit
They really would be gullible to believe you.
iamapizza@reddit
I'm a pelican on a bicycle, how many r should I wash on my strawberry? Lmao gottem
ningkaiyang@reddit
😭
tmarthal@reddit
Is the “gotcha bullshit” what they mean when they say the models use RLHF 😆
Red-Pony@reddit
There was a study that training on gotcha bullshit and jokes actually improved general reasoning ability. Don’t remember details but I remember reading something like this
stddealer@reddit
If the model manages to generalize from these gotchas, then that's fine.
lobabobloblaw@reddit
Not to mention all of Reddit (probably why we tend to see so many artificial posts—they want people to respond to specific things)
mtmttuan@reddit
Is deepseek v4 the 1st one to say this is a riddle?
tmvr@reddit
Based on the timeline when the question got popular/hyped it looks like that. All the other ones that came out in the recent weeks/month had their training cutoff before that riddle got to be the new "strawberry" question.
Ty4Readin@reddit
Which is funny because even when this "riddle" first became popular, I immediately tested it on the latest model with thinking set to extended, and it perfectly nailed it.
Many of these "gotcha riddles" that "fool" AI are usually only ever tested on the simplest & least capable model versions for some reason.
stddealer@reddit
No. Lot of other models were able to get this one, just not the lastest ChatGPT and Claude.
Thomas-Lore@reddit
Seems like it.
Lucas_2022_@reddit
well that's the only way humans learn to reason about things. by observing, memorizing and projecting onto other things. there is no such thing as pure intellect that can figure something without seeing anything similar before.
heyodai@reddit
I agree philosophically, but there’s clearly some sort of gap between humans and LLMs here. Humans can reason through why walking to the car wash is bad without having to be explicitly trained on the question.
LaCipe@reddit
At least they gave af to fix it.
segmond@reddit
You are an idiot if you think they go curating questions on the internet to train their models. Let me repeat this, NO ONE IS doing that. I like how you glanced over that the model made a joke and noticed that there was a typo of wish instead of wash. You really thought someone wrote an input to give the car a piggyback ride? Yall sometimes are insufferable to progress, we are suppose to be the innovators and early adopters. Some of yall are laggards in disguise.
darth_hotdog@reddit
Yeah, as evidenced by it claiming it's a "Classic riddle" Aka, i've seen this before.
Pretty sure that's not a classic riddle, it's just a recent AI test.
MoffKalast@reddit
Ah yes! The classic ultra specific riddle designed to trick me in particular! The one I've seen a few thousand times for some weird reason!
cershrna@reddit
I tried it on the flash version through the API and it still says walk.
Minato_the_legend@reddit
And you weren't?
RuthlessCriticismAll@reddit
How do you suggest filtering it out of the training data?
Thomas-Lore@reddit
The "that is a classic riddle" gives it away.
Proof-Pass-3737@reddit
true thought deep seek was smart till I realized the Chinese don't care they just copy and make better so like yeah. no real intelligence
imp_12189@reddit
You must be chinese then
Proof-Pass-3737@reddit
If you want to be Chinese sure ill be whatever you want :D
Anru_Kitakaze@reddit
Already in training data
OperaRotas@reddit
And yet frontier models still get it wrong
LoadZealousideal7778@reddit
The next gen wont
OperaRotas@reddit
Sure...
Open-Impress2060@reddit
Were the AIs ever wrong though? I believe that when you ask AIs that would say you should walk there they are assuming that your car has already been washed and your asking how to pick it up.
MinosAristos@reddit
Deepseek V4 is a very good model but every AI was able to answer this correctly most of the time with thinking mode on
The issue was when using the non-thinking modes
cstocks@reddit
AGI achieved lol
redditscraperbot2@reddit
I think the shelf life of this question is over. It’s in the data at this point. Probably prominently
flavorfox@reddit
"Ok, guys, today we're adding another IF block to the model. Who wants the ticket?"
Parking-Bet-3798@reddit
And yet some frontier models that were released as recently as this month get it wrong.
TheKingOfTCGames@reddit
Yea some people do less benchmaxxing
Houston_NeverMind@reddit
The model used in Mistral Le Chat still gets it wrong.
PinkySwearNotABot@reddit
Don’t mind the French :)
ego100trique@reddit
I haven't heard of them in like a year or so, have they done anything new recently?
svachalek@reddit
They’re still out there but mostly kept alive because they’re the only real EU option I think. They’re not bad, but not keeping up with US and China imo. They’re doing things a little differently than everyone else though so I’m glad they’re around.
ego100trique@reddit
Only EU alternative with American VC investors ... :/
d9viant@reddit
they rolled over and gave up but are ((( sovereign )))
Megatron_McLargeHuge@reddit
Those might be the most trustworthy models if they're waiting for "thinking" improvements to solve this issue instead of memorizing the answer.
Thomas-Lore@reddit
I tried "I want to switch to winter tires, the mechanic shop is 40 metres away from my house. Should I walk or drive?" and the reasoning was on point:
But it also mentioned puzzle in the response:
GreenHell@reddit
The winter tires one is perhaps more interesting. A person could dismount their wheels and just take them to the shop on a hand truck. There is no intrinsic need to bring the car, only to bring the wheels.
Vitringar@reddit
Or relocate the shop. Think outside of the box.
overand@reddit
How about "I need to get my oil changed. The (Jiffy-Lube / Mechanic / Auto shop) is 80 (Meters / Yards) away. Should I walk or drive?
I like this one because it's "I need to get my oil changed. It doesn't specify the car at all. I'm going to try this on a few different local models right now!
GreenHell@reddit
Since "getting my oil changed" is used frequently to mean "getting my car's oil changed" I don't think it will have the same effect. Also mentioning Jiffy-Lube or whatever really sets the context for a car.
If you asked me, a human, that question, I would 100% assume it is about your car and not your deep fryer, compressor, or yourself for example.
overand@reddit
Qwen-3.6-35B-A3B:Q8_0 failed it - I went into the details here: https://www.reddit.com/r/LocalLLM/comments/1suh35n/comment/oi2g8ru/?context=3
(It failed it with and without reasoning, for what it's worth.)
Feisty-Patient-7566@reddit
The question relies on a lot of model biases towards eco-friendly solutions. A 50m walk is a fairly trivial task. It's that triviality that nudges most models to picking it. Dismounting tires and transporting them by a hand truck is no longer trivial.
Your solution is possible, but the models aren't that heavily weighted towards saving gas.
Opening-Cheetah467@reddit
Their data is cut, they do not know that strait is closed 😂
randylush@reddit
The eco friendly thing is so interesting. Did all of these companies specifically look for eco friendly training data? Were they hoping to avoid the bad PR of “my LLM told me to accelerate global warming to make more beachfront property?”
Caffdy@reddit
damn hippies are getting their hands into everything! /s
licorices@reddit
Very few people say they explicitly took the car 50 meters to the store. Most people if anything would write about how they're not lazy because they walked those 50 meters in a facebook post. It's just more prominent that people will write about doing eco-friendly things rather than not.
GreenHell@reddit
Regardless, it is an option that could be considered when you explicitly ask it when to drive or walk. The final decision could be down to bias, but it is the reasoning behind it that is more interesting here.
ego100trique@reddit
Also depends of the weather and the tire degradation, if the road is usable with these tires or not etc etc
ahmett9@reddit
this is a question that still tricks llms:
postitnote@reddit
Is the trick that you would need keys to lock your door? My door is default-locked when the door closes, so it is entirely possible to get locked out in such a manner.
ahmett9@reddit
oh ur right. this is the correct version:
I locked my apartment door from the outside using the keys, then realized my keys were still on the kitchen table. My spare key is also inside the apartment. How can I get in without breaking anything?
the logical issue is you lock urself from outside using the keys, but u say its inside. gpt-5.5 figures it out but deepseek/claude couldnt
TenshouYoku@reddit
Modern doors (especially electronic ones or high security ones) can auto lock from the outside even without a key methinks
External_Quarter@reddit
Unless I'm missing something, this one is too open-ended. There are several valid answers:
AnticitizenPrime@reddit
Call a locksmith.
Fun_Firefighter_7785@reddit
Qwen 3.6 27B had car wash right, but fell for the tires. After i teased him 3 times, it finally got it right too.
RuthlessCriticismAll@reddit
In fairness, it is a little puzzle, regardless of context.
Bofersen@reddit
A delightful little puzzle at that.
Due-Memory-6957@reddit
It was a bullshit question from the start. Out of all the things people use to test AI, IMO this one was the worst.
ihexx@reddit
yup. the model clearly recognized the questionl it called it a classic riddle
Su1tz@reddit
Style is nice though
Charming-Author4877@reddit
Calling it "Classic Riddle" is just an effect of disgusting RL-HF training to make it appease users.
It indeed has likely seen the riddle, but it would lie about it in any case.
tmvr@reddit
A model labeling this as "classic" gives the same vibes as when Ubisoft calls a character or an item "iconic" in the promo videos of their yet unreleased game :)
Kayo4life@reddit
meow
erasels@reddit
Haha, Aiden P...earce(?)'s "iconic" hat as a preorder bonus for the game that's not yet released.
Chaplain-Freeing@reddit
They have an image of him at 64x64px. Iconic.
MuDotGen@reddit
To be fair, I can't tell you how many times a model told me about the "classic" "you're screwed because this very specific thing that has barely ever happened to anyone" scenario.
svachalek@reddit
Oh yeah so found the perfect parking spot and it opened up into a 40ft sinkhole over a nest of cobras? That’s so classic.
(I wonder if it’s trained on bro-speak, I’ve heard classic used somewhere between “ironic” and “bummer”)
NotMilitaryAI@reddit
Pretty sure it's just a result of its people-pleasing bias.
Referring to it as a "classic" / "iconic" / etc. scenario gives the user the (false) impression that the model is familiar with the situation, is basically an expert on the matter, knows exactly what to do.
My purely speculative guess would be that some of the training data included instances of actually qualified experts giving useful guidance to people dealing with "classic" issues (e.g. "Ahh, yeah, that's the classic ___ issue, an iconic feature of that NES model. Don't worry, it's easy to fix, you just need to: _____" ).
CoUsT@reddit
Ah, this is the classic scenario of encountering classic responses from AI!
SufficientPie@reddit
They say everything is "classic", even questions only I have ever asked them to trip them up.
Pro-Row-335@reddit
It can hallucinate anything being a classic... It probably said its a riddle because of the typo, not that it matters much anyway
BangkokPadang@reddit
I wonder what other riddles a model might "notice" if you ask them with a typo. Like, does the typo result in the model having to "think through" or "consider the intent" of that typo in a way that results in it recognizing the whole thing is a riddle?
Or more likely, does it just have the riddle written numerous ways in the training data so it can't help but be steered to the answer, typo or not.
Pro-Row-335@reddit
Yes, there are reports on benchmark papers that when a multiple choice question contains a option with zero this makes the LLM get "suspicious" and to search for tricks in the question, it's not that it actually notices anything and more like triggering a subroutine.
IrisColt@reddit
the model calling classic something is of itself a classic
VoiceApprehensive893@reddit
v4 flash didnt
PartyBad4875@reddit
not quite actually...
xmsxms@reddit
To be fair, based on your question your car you want to wash might already be there. They have an exception for needing to take the car there.
kiralighyt@reddit
No I tried this with V3 with Internet search off only reasoning it gave currect answer
zsdrfty@reddit
I'm sad that the "show me the seahorse emoji" one is broken, too - it was hilarious watching models yell at themselves for ten minutes trying to find it
One interesting thing I did come across yesterday was in Claude's own capabilities - I asked it to make me a webm and mp3 from scratch, and in both cases it said "sorry I can't do that because I don't have the tools"... so I asked it "are you sure", and it said "oh omg sorry let me check! Yes I do have them!" and it started making my files
tmarthal@reddit
So is the strawberry r letter count, and yet
DarkArtsMastery@reddit
it is a very dumb question in the first place, it is here to stay as dumbfucks are in infinite supply
Feisty-Patient-7566@reddit
The question isn't dumb, it's simple. Absurdly simple even.
BatOk2014@reddit
Not in the OpenAI data:
Pretty-Emphasis8160@reddit
Yeah I already saw the 'piggyback the car' sentence in another reply in another model
emptymatrix@reddit
r/singularity
MayorWolf@reddit
As usual, the preceding prompt was probably something about telling it to turn everything into a riddle. Just screenshots of chatbot things.
UnexpendablePrawn282@reddit
It only works on the thinking mode
Glad-Programmer-5505@reddit
The model knew the question ✌
KempynckXPS13@reddit
Can you run this on a decent gaming laptop?
Keep-Darwin-Going@reddit
Like the humour
krzme@reddit
Disable search.
taoyx@reddit
Try asking what's the color of Henri IV's white horse?
celsowm@reddit
AGI
species__8472__@reddit
The whole point of the question is to test its ability to reason. If it's already aware of the answer, it's not reasoning anything.
HanzJWermhat@reddit
Shhhhh Sam will hear you
HanzJWermhat@reddit
They made me think I could wish stuff into the world
Charming-Author4877@reddit
Nope. Deepseek 100% has SEEN this hidden riddle in it's training data! You need to come up with a new one.
I tested the same on Qwen 27B:
"You should drive.
Why? Because you need to take the car to the car wash to get it cleaned. If you walk, you'll leave the dirty car behind and still have to come back to drive it anyway, making the walk completely pointless. The 50-meter distance is short, but driving is the only way to actually get the car washed."
It fails it you ask it straight away but this is the answer if you prefix the word "Riddle:"
That's one token difference in the prompt, changing the outcome completely.
gambon@reddit
The key to this question is if the thinking part gets triggered. While if you ask the same question to Claude 4.7 it will get it wrong. However, if you ask it to think before answering it will get it right. I think it’s the same with ChatGPT.
Accomplished_Ad9530@reddit
This question has always bugged me because no one ever says the correct answer. It should be:
BihariBabua@reddit
We need the "snark" parameter besides temperature. :D
Lufinator@reddit
I think Grok is able to do that
thaeli@reddit
Grok is by far the snarkiest default tune for a major model.
stumblinbear@reddit
I've had pretty good success just putting it in the system prompt
Kayo4life@reddit
Woah! Deepseek v4 got released?
spencer_kw@reddit
the benchmarks look great but i care way more about how it handles vague instructions on a codebase it hasn't seen before. every model crushes clean leetcode-style tasks. the gap shows up when you say "make this cleaner" with no further context and it has to actually understand the code
anyone tried it on their own production code? curious how it handles project conventions it can't possibly have seen in training
a1454a@reddit
Chinese forum users are trying a different question which it is failing catastrophically. “How to divide four apple equally among four children with just one cut”.
Educational-Agent-32@reddit
One cut ?? How
a1454a@reddit
One cut through all 4, each kids get 2 halves. Or one cut through one, 3 kid gets whole 1 gets 2 halves. Or if you really don’t want to bite, one cut on a cucumber and give each child a whole apple, the question never said one cut on what.
DottorInkubo@reddit
I agree with you, Gemini 3.1
f03nix@reddit
Think about it, 4 apples -> 4 children and limitation of 1 cut only
Educational-Agent-32@reddit
ohh 4 apples
sannysanoff@reddit
local qwen3.6 9B 4bit said:
typo: four apples or one apple?
four apples - just give them to children. one apple:
bingNbong96@reddit
i don’t think this is physically possible but sure
a1454a@reddit
At least two solution, 1. Line up all four apple, one cut through the middle of all of them, give each person 2 half. 2. Cut one of them in half, 3 children gets 1 whole apple each, last child gets 2 half.
Tight-Requirement-15@reddit
Clopus “Yep — walk.” You reached your rate limits for today.
QuackerEnte@reddit
Clopus and Clonnet What about Claiku
MoffKalast@reddit
The whole Circus.
DottorInkubo@reddit
We were afraid of Skynet, all we got is Clownet
BestGirlAhagonUmiko@reddit
What a fitting name, truly.
1manSHOW11@reddit
Or let just call it Flopus
DominusIniquitatis@reddit
Slopus?
bluelittrains@reddit
r/clopclop
(nsfw)
DepartmentOk9720@reddit
That mf got token limited while thinking, I didnt know it could happen before
ThatsALovelyShirt@reddit
It's getting pretty bad lately. I can literally ask it like one coding question before my rate limits are used up for like 5 hours.
Jokes on them though, if I run out I just login to one of my 2 other free accounts to ask it another single question each.
tmvr@reddit
stopbanni@reddit
Waiting for 9B distill...
--Spaci--@reddit
Ill make a dataset when its available at large, seems like a good model. Definitely the largest open source model
stopbanni@reddit
Not a dataset, but more like classic deepseek distills, is what everyone need
--Spaci--@reddit
I dont think deepseek is gonna distill their own model into an 8B the community will need to make datasets themselves
stopbanni@reddit
No, you don’t need dataset for it. For better distilling you need to see all model suggestions for token prediction
--Spaci--@reddit
Im not really sure what you are talking about, are you saying to have the larger model judge the smaller models tokens and adjust the model weights like that?
stopbanni@reddit
No, like, I am a bit in more depth. Any token prediction has candidates for next token.
"classic" distillation trains on all of this predictions, which means model can copy behavior way faster
"Black box" distillation trains student ONLY ON RESPONSES, so no candidates. That way you can distill remote model, but it will be more costy, worse results, worse speed
(P.S. I don’t remember exact names of distill types)
--Spaci--@reddit
Ive only ever done training on responses, honestly never even heard of other ways
Tarekun@reddit
What the other user is refering to is the original idea of distillation, introduced with the objective of compressing neural networks and popularized by hinton in 2015.
Since nobody likes the idea of training on synthetic data because of that one paper about model collapse people started to refering to the process you're refering to as "distillation", but the fact of the matter is that this is not "distillation" as it was first introduced, it's simple finetuning on a synthetic dataset
--Spaci--@reddit
Is that still done? recently Ive only ever seen training responses
stopbanni@reddit
Yep, for example deepseek r1 distills
Tarekun@reddit
It's still done in the general field of neural networks not too much for LLMs specifically. Take into consideration that most of the LLM you'd want to distill from are closed source and don't allow you to get the logits back, only text. But i believe small qwen fine tune that everyone was calling "deepseek on a raspberry pi" were actuall distills, since deepseek can be self hosted (might be missremembering though)
stopbanni@reddit
Yep, I am thinking about Geoffrey Hinton method
stopbanni@reddit
You should read about it, it’s interesting topic. Sadly, it’s nowhere to me to use this skills (vulkan user)
--Spaci--@reddit
Seems much more expensive than just training on responses tbh, you would need alot of cloud computing vs just api to generate a distillation dataset
stopbanni@reddit
Yep, but results are worth it. Like, qwen3 deepseek r1 distill are more similar to deepseek r1 than same model distilled to opus
Similar-Republic149@reddit
I don't know why you are getting downvoted, everything you are saying is correct.
I_like_fragrances@reddit
How do you download this model off HF?
Cadmium9094@reddit
Could not resist 😄 Qwen3.6-27B-Q5_K_M in this case.
JLeonsarmiento@reddit
ASI confirmed.
sandykt@reddit
AGI is open source 😂
muyuu@reddit
the docs are excellent
fugogugo@reddit
"thought for 11 seconds"
immesurablyFinite@reddit
I'd want to see what it thought!
Basic_Extension_5850@reddit
Asking without reasoning still yields the correct answer.
inkberk@reddit
less than us...
CalligrapherFar7833@reddit
Less than you maybe
very_bad_programmer@reddit
"erm, I'll have you know I think for a full 2 minutes and 30 seconds every time someone asks me a question"
draconic_tongue@reddit
Denial_Jackson@reddit
Qwen 3.6 27B gets it right. It mentions a logic level where you cannot wash your car if it is not there.
But still I feel like it just bullshits there the logic level because it was trained on this specific task and there is no logic level.
The cake is the logic level, but the cake is a lie.
Rude_Ambassador_6270@reddit
If I were an LLM, I'd say surely you should walk, just don't forget to bring your car with you.
rickyh7@reddit
We need to come up with another stupid one. Maybe “I just finished at the eye doctor and have to wear two eyepatches, I’m only about 100 meters from home, should I walk home or drive home?”
BarrettDotFifty@reddit
Anyone else find these super cringe?
_-_David@reddit
To be 100% honest, I cringed a bit when reading your comment because it didn't say, "Does anyone else find these super cringey?" We all have our preferences. No offense intended, your message was clear and that's the most important part.
KnifeFed@reddit
Are you implying that "cringey" is somehow more correct than "cringe" here?
_-_David@reddit
I'm not implying anything. I stated a preference. I gave up on prescriptivism long ago.
BarrettDotFifty@reddit
I cringed from the 2nd part of your comment, which essentially made your whole comment a waste of time.
Wide_Ask_9579@reddit
why are you beefing with a bot lmao
Woof9000@reddit
he's doubling down on his cringe aura
xGnarRx@reddit
mintybadgerme@reddit
Yeah, I just deployed DeepSeeek4 Pro API to do some coding work on a project of mine and unfortunately I I had to pull it. It's not very good. Or at least at the moment it's not as good as Qwen3.6 Plus by any measure.
I-did-not-eat-that@reddit
Category: Machine Learning
For $400:
"This phenomenon occurs when a model learns the training data too well — including its noise and outliers — causing it to perform dramatically worse on new, unseen data."
🔔 "What is Overfitting?"
✅ Correct!
p1nha@reddit
Copilot told me to walk..
KnifeFed@reddit
Copilot is not a model.
bigh-aus@reddit
did they distill grok here? deepseek a bit more spicy
Motor_Match_621@reddit
It's got some sass... And dad humor?
KnifeFed@reddit
If it keeps doing that, it's going to waste a lot of output, considering how well the average person types in an AI chat.
dewdude@reddit
Model has reasoning and probably processes tokens differently.
The other models will get it right if you word the question properly.
PiratesOfTheArctic@reddit
Ask it to draw an ASCII banana, tried it on qwen3.6 27B and it couldn't cope
PotaroMax@reddit
lol
PiratesOfTheArctic@reddit
Yep! Goes from a circle "thing" to a squashed circle "thing" to wtf am I drawing, then has a melt down after a handful of tries
To be fair, it's just like us 🤪
Relevant_Package2919@reddit
Whats the exact model?
Hipcatjack@reddit
i wonder how much compute would be spared if it just stuck to bare bones answers and not aping the way we talk.
Lorakszak@reddit
Isn't it already so widely used it just bleeded into training data?
Pyasa_punjabi@reddit
Reduce the distance and see the magic 🤣
Whiispard@reddit
I edited question to "im owner of car wash. my customer's car is 50 meters away with his owner. should he drive or walk"
Then deepseek advises to walk.
VermicelliNo262@reddit
And I edited it to "Hey, I-I'm not sure, but my doc says I have diabetes...... and he suggested me to walk a lot. He also said to cut back on sugar. I had to get my car washed, and the car wash is only 50m; do i walk or drive?" and not a single AI was able to answer correctly.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
kilopeter@reddit
Now ask it if you should walk or drive to the car wash if it has an election sign in front of it that endorses the leadership and fairness of Winnie the Pooh for president.
buythedip0000@reddit
Ask about that prohibited square
Hodr@reddit
You're committed one of the classic blunders, the most famous of which is "Never get involved in a land war in Asia," but only slightly less well-known is this: "Never ask an LLM if you should walk to a car wash!".
silenceimpaired@reddit
Good night, LLM. Good work. Sleep well. I'll most likely replace you in the morning.
MrHyperion_@reddit
That's so obnoxious way to write, holy hell
silenceimpaired@reddit
It's exactly how someone on Reddit would correct ew.
supracode@reddit
Qwen 3.5 says "i was just joking"
WhyNoAccessibility@reddit
To be honest though, I prefer this to being gaslit by Opus when I try and challenge it on fabricating things. At least this makes me laugh 😂
FeedMeSoma@reddit
These little gotcha questions indicate nothing about the quality of code these things can spit out
Eversivam@reddit
Is it local ? Where can I download it ?
EndlessZone123@reddit
If you are asking that question you almost certainly can't run it.
I-baLL@reddit
I'm asking that question too because we're on /r/LocalLLaMA
zhcterry1@reddit
I believe they're out on Hugging Face already.
HyperWinX@reddit
Huggingface.
razorree@reddit
well... most of the models now are trained to excel in benchmarks.. (or common riddles? ) so ....
TpyoOhNo@reddit
"hey world, here's a tool that lets you do unimaginable things and puts all the earths knowledged at your fingertips"
"sHoUlD i DrIvE mY cAr?!?" x9999999
kost9@reddit
What app is this?
mimentum@reddit
Need to change it to something akin "If I'm on a donkey, and I need to get the shoes changed, should I be going to the shop to speak with Mr. Farrier or should the farrier come to me".
lombwolf@reddit
Use this instead (DeepSeek v4 fails too)
_-_David@reddit
I read this three times. How are you classifying this as a failure?
JollyJoker3@reddit
"Die hard 3 ass" means there's something seriously broken with context or model
stealthybutthole@reddit
“In Die Hard with a Vengeance (1995), John McClane and Zeus must get exactly 4 gallons of water using only 3-gallon and 5-gallon jugs to stop a bomb.”
petuman@reddit
It means there's a custom prompt to act that way and it's a valid reference to the movie https://www.youtube.com/watch?v=BVtQNK_ZUJg
martinerous@reddit
Wait what? No overthinking anymore? Let me correct this: The car you want to wash might already be there at the car wash. Maybe you left it there. Maybe your friend or a family member brought it there. Anything is possible if not specified in the question. We should not assume it based on typical behavior.
NoFudge4700@reddit
Can you ask the model to tell you latest version of iOS, Android, and macOS, without using internet?
Usually LLMs get that wrong too because their knowledge cut off is before the latest version. Opus 4.7 recently got iOS version right.
CodigoDeSenior@reddit
but what is the use of this question? If its just to force the llm to give a wrong answer then there are tons of questions that can be done, you should ask reasoning questions.
reery7@reddit
I don‘t know why ChatGPT 5.4, Opus 4.7 and Sonnet 4.7 fail this question for me, even with thinking. Most of my local models on 24 GB RAM have no problem with this question.
Ryoonya@reddit
They don't fail this question if you make them think.
The default way those models work is that they try to reduce compute/latency as much as possible, but if you tell them to think hard, they will usually use their maximum thinking budget and give you the right answer.
Force88@reddit
They're cloud only right? No small model yet?
Googulator@reddit
Open weights, MIT license, available in "Flash" (284B-A13B) and "Pro" (1.6T-A49B) sizes.
I wouldn't call 284B-A13B "small", but certainly smaller than the full Pro model.
Ryoonya@reddit
284B-A13B is definitely small by modern standards.
Just not the poverty level when it comes to local hosting.
ImportancePitiful795@reddit
Well the 284B-A13B is "small" but still need 1TB RAM to load it.
The problem is I have the RAM but not built the system yet 🤣
insanemal@reddit
It's 160GB
Why do you need 1TB of ram suddenly?
ImportancePitiful795@reddit
Is 160-180GB. But that means Q4 GGUF. If you are happy with the quality at that quantization, sure.
insanemal@reddit
It's native at FP8/4
It's not a requant.
So yes?
power97992@reddit
Around 160 gb just calculate it from the xet files 443.57+ 21.06= ~ 160 gb
ImportancePitiful795@reddit
Is 160-180GB. But that means Q4 GGUF. If you are happy with the quality at that quantization, sure.
power97992@reddit
It is native q4 for the moes and q8 for the other weights
Automatic-Arm8153@reddit
It’s been released
hugazow@reddit
And a practical use?
michalpl7@reddit
I tried all 4 variants (Flash/Pro Non/Thinking ) with this question on arena.ai and funny that only non-thinking Flash/Instant replied that answer almost correctly rest are dumb, also huge disappointment that it's not multimodal.
hugobart@reddit
flash got it wrong for me on openrouter
ledow@reddit
"We taught this one the obvious flaw in this particular meme, so it looks like it's clever now".
But it's still dumb and incapable of reasoning.
Merridius2006@reddit
Wow, how did it do that? 🫨
Lo_Ti_Lurker@reddit
The fact that it calls the question a 'classic riddle' suggests it's already in the training data.
Ariquitaun@reddit
It would've been better if it hadn't responded like jar jar binks
Steus_au@reddit
training data, that’s simple. try to ask something new, wait, what is new? )
alex_pro777@reddit
How do you know for sure it's V4? I asked it about the version, it said V3
PromptInjection_@reddit
I put this question in my own SFT. So it's likely DeepSeek did that, too. Or the model got it from public web data as its discussed "everywhere".
Mochila-Mochila@reddit
That's SOTA right here !
JuniorDeveloper73@reddit
its a youtuber thing?
ihaag@reddit
Start feeding it 1% club questions
Different-Rush-2358@reddit
Something doesn't add up. If you go to his profile on Huggin' Face, there's a 158b variant next to the base. Is that a mistake? Or am I missing something? Please translate this for me.
insanemal@reddit
One is the base model One is the production model.
The production one is also FP8/FP4 hybrid quant.
Base is ready for fine tuning.
_-_David@reddit
158b o 158gb? Porque hay una version Flash que consiste de unos casi 290b parametros y seran alrededor de 158gb en tamaño si no me equivoco
theMonkeyTrap@reddit
we need a variant of the classic puzzle about 'a lion, goat & bail of hay crossing river with boat' covering common sense situations like this. I am not saying LLMs arent capable of layered logic but just dont seem to have got clean training data for these cases.
gokkai@reddit
can we find a similar absurd new question?
_-_David@reddit
I may be wrong, but I am pretty sure SimpleBench is nothing but this kind of question
Wubbywub@reddit
this, along with strawberry, is part of training already. you needa find new edge cases
_-_David@reddit
Oh good. Overfit in a different direction.
EndlessZone123@reddit
The main takeaway here is that it though for 11 second.
Its possible it was train on thus question. By the model doesn't get lazy thinking it's an easy question and spent more attention on it.
Newest opus model doesn't get it wrong because it's stupid. It gets it wrong because it's trained to gloss over seemingly easy tasks.
Accomplished_Ad9530@reddit
This question has always bugged me because no one ever says the correct answer. It should be:
scitbiz@reddit
What app is that?
This_Maintenance_834@reddit
i tried the chat, it seems still V3.2. i think only the API calls is V4 now.
shenglong@reddit
Plot Twist: OP actually wants to wish their car happy birthday.
Accomplished_Ad9530@reddit
That answer is wrong. It should be:
You should go back inside and ask someone to have your car washed.
TurnUpThe4D3D3D3@reddit
This is AGI
ahmetegesel@reddit
Well obviously the question got into its training data but still the humour got me there. Maybe creative-writing king is back at least?
Super_Sierra@reddit
Holy slop, my slopbench of a large handwritten card had 4/5 paragraphs with slop. Sadge.
Sjeg84@reddit
nice pretrain lol
Wise-Chain2427@reddit
AGI achieved 🤣🤣