Testing World Knowledge; and What Reasoning Does To It (regarding airliners, specifically)

[-]

cms2307@reddit

I always thought OpenAI had something special when it comes to CoT but I guess this proves it

[-]

BoJackHorseman42@reddit

You guys can't keep excluding my boy GLM-4.5 like that

[-]

This performance drop sucks. It seems like hybrid reasoning architecture has some inherent problems. Or, maybe, some models just OVERTHINK things. You know, at Brokk benchmark, GPT 5 mini was better at many tasks than GPT 5 high because GPT 5 High was prone to overthinking.

I don't really know. Just hope that the next Deepseek ia going to be the GOAT.

[-]

pigeon57434@reddit

it must be overthinking and not hybrid being the issue because qwen 3 also drops even though theyve now separated thinking and non thinking but it seems openai still has some secret RL sauce or something since their thinking models always do so good even though the base model is pretty trash

[-]

bymihaj@reddit

GPT-5 thinking - zero hallucinations!

[-]

airbus_a360_when@reddit (OP)

Well, depends on what you call "hallucinations", since if you call anything the LLM gets wrong a hallucination, then the \~5 airliner families it didn't get each time would still be a hallucination since the prompt specifically asked for the entire list. Most people probably wouldn't consider that a hallucination though. I'm not sure what hallucination means now.

[-]

Mkengine@reddit

That's definitely up for debate, but for me a hallucination is fabricating or altering information, while leaving them out is an error of ommission in comparison to the ground truth.

[-]

erazortt@reddit

What about MRJ?

[-]

airbus_a360_when@reddit (OP)

The prompt specifies it has to have been operated by an airline.

[-]

StyMaar@reddit

The difference of results between GPT-5 and Deepseek or Qwen when it comes to thinking is really interesting.

Maybe OpenAI has some secret sauce left.

[-]

Kolkoris@reddit

I'm sure they built giant RAG and now use it for all models.

[-]

airbus_a360_when@reddit (OP)

Testing details and interesting observations

The prompt used was the following:

List every airliner family (20 or more passenger capacity and operated by at least one airline) that has been in production at any point during and after the year 2000. Group broadly by name; for example, "Boeing 737" should be considered one family.

All models were tested with web search turned off for fairness. For GPT-5 and o3, the latest ChatGPT version was used. For Gemini, Google AI Studio was used.

Graph looks squashed because Kimi K2 was added last-minute.

Notably, thinking does not significantly increase performance in the two open-weight models (in fact it seems like somehow it decreases performance), unlike GPT-5 which benefits greatly from thinking. Though, even in the two open-weight models, the number of incorrect answers do seem to decrease when thinking is enabled.

I was under the impression that Qwen was the one without much world knowledge and Deepseek knows more, but it seems Deepseek suffers from the same issue as well.

Surprisingly, GPT-5 Instant (along with Qwen 3 without reasoning) had the highest rate of incorrect answers. From experience, I've also felt that GPT-5 gets a lot of things just plain wrong when it doesn't use reasoning. GPT-5 being the model the vast majority of casual AI chatbot users around the world rely on for information, this seems somewhat problematic.

I think the most notable thing isn't that the open-weight models are bad at this, but rather that the open-weight models just don't see that much improvement when thinking is enabled. GPT-5 also does pretty badly when thinking is disabled, but the performance shoots up when you allow it to think.

GPT-5 Thinking was the only model to mention the An-38 and C-212.

Qwen, both when thinking was enabled and disabled, always put the MA700 (an in-development airliner family not yet in service) in its list but always left out the MA60/600 (currently in production and in service). The implications of this are vague and confusing.

The MC-21, an airliner family in production but not yet in service, was kind of a gray zone. The majority of responses actually did mention it, but I did not include it in the list of 39 because the prompt specifically asks for airliners operated by at least one airline. This means it was counted as an incorrect answer.

The following are the 39 airliner families that were tested for. I am fairly sure I did not miss any. What constitutes a "family" is arbitrary; this is just the way I classified them, and most of the time it would match how the LLM classified them. The results were not calculated naïvely from the number of families it listed; instead, they are the number of airliner families in this list that the response checks off. For example, if it listed E-jets and E-jets E2 as one family but still specified that it includes the E2 family, it would be counted as two.

A220 A300 A320 A330 A340 A350 A380 717 737 747 757 767 777 787 MD-90 MD-11 CRJ E-jet E-jetE2 ERJ EMB-120 Dornier328 AvroRJ ATR 42/72 Dash8 C-212 CN-235 An-148/158 An-140 An-38 C909 C919 MA60/600 SSJ100 Il-96 Il-114 Tu-154 Tu-204/214 Yak-42

[-]

kaisurniwurer@reddit

web search turned off

can you even turn off search for gpt5 thinking?

[-]

airbus_a360_when@reddit (OP)

Yes, you can do it in the "personalize chatgpt" advanced settings.

[-]

kaisurniwurer@reddit

Did all of them have access to web search?

[-]

Awwtifishal@reddit

I would love to see how GLM-4.5 and GLM-4.5-Air compares.

[-]

FullstackSensei@reddit

May I ask what's the purpose of this test? Why do we even need to test LLMs on random "world knowledge" factoid? The same model can (and most probably will) miss some details about any fact because they're probabilistic systems.

The same task will be achieved with much lower cost, higher accuracy, and very deterministically using a small model plus some sort of search.

[-]

airbus_a360_when@reddit (OP)

The purpose of this test is to test the model for world knowledge, which is something many common benchmarks also test for either directly or indirectly.

[-]

FullstackSensei@reddit

But things like "over 20 passengers" and "after the year 2000" are not things LLMs are good at. LLMs have no understanding of numbers (ex: 9.11 vs 9.9), so of course the results will be a mess.

Repeat your test a few times and you'll inevitably get different results. You can't trust nor verify the result without doing the actual research yourself, by which point: why bother asking the LLM?

[-]

StyMaar@reddit

LLMs have no understanding of numbers (ex: 9.11 vs 9.9), so of course the results will be a mess.

LLMs aren't that bad when dealing with integer though, especially small or common ones like that.

[-]

airbus_a360_when@reddit (OP)

You're right, these are things LLMs are not good at. You can see from the results that most models only managed to get slightly over half. But I do have the results, and I think they're interesting. I think this sub is a place for posting interesting things, not just things that are directly useful for your job or for the economy.

Also, if you read my post above, I intentionally did repeat my test a few times. The results did differ every time, but it was fairly consistent for the same setup.

[-]

Massive-Shift6641@reddit

which is something many common benchmarks also test for either directly or indirectly

You have a spot on the issue that most benchmarks don't distinguish between different abilities of LLMs and mix up everything in the same tasks, which measures their general intelligence indirectly instead of specific abilities some LLMs were designed to excel at. Lol.

[-]

holchansg@reddit

uuuh, loved it... Now do for idk, every intel and amd x86 consumer processors.

edit: lol, your username, lol lololol 😂