AI-written Code Banned from Codeforces: What's Changing?
Posted by ImpressiveContest283@reddit | programming | View on Reddit | 123 comments
Posted by ImpressiveContest283@reddit | programming | View on Reddit | 123 comments
zazzersmel@reddit
the data they collect to sell to ai companies is much less valuable if its ai generated
great_waldini@reddit
I’m not sure how true that is anymore when these models are better than most humans at most things. High quality and clean synthetic data may soon (or now) be worth more than anything human generated.
realityChemist@reddit
This is a slight aside, but that's quite a low bar. Keep in mind that most people are shockingly bad at most things. A uniformly randomly selected individual most likely could not write hello world without guidance, and probably couldn't give the formulas for more than olivine and one or two feldspars (and quartz, of course). Hell, you'd have a decent shot that whoever you selected couldn't explain how to cook dried beans.
I think it's much more relevant to compare capabilities against subject matter experts, or at least against humans who have some familiarity with the subject at hand. In those comparisons LLMs come up short against many humans. I encourage everyone to interrogate ChatGPT or any of its relatives on a subject you know well. It's likely to get basic definition questions right, but ask it about anything you've learned from experience or that's not easy to google and see how it does.
To bring it back around to the original point: you cannot then train a new model on the faulty output of this model, lest the progeny LLMs inherit a lot of misunderstandings about the world and strange behavioral quirks. There's actually a catchy name for what happens if you do this: Model Autophagy Disorder (MAD).
great_waldini@reddit
This actually is not an “aside”, this is precisely the point I was alluding to.
I agree that subject matter experts are still better 10/10 times. However, the amount of suitable data available of this variety is nowhere near enough to train an LLM. So for all practical purposes, training is still largely dependent on the corpuses of human generated text, which as you implied above, is mostly garbage.
Garbage in will always equal garbage out.
Yes. And if it was a question of human internet corpus vs GPT 3.5 generated synthetic data, I would expect the human generated text to produce a working LLM while the relatively homogenous 3.5 corpus would create collapse (or otherwise dogshit due to MAD).
However, my comment was about the outputs of SOTA models like o1, not that you’d want to exclusively train on that either.
My comment was only ever meant to say that I suspect high quality synthetic data (like o1 generated) is probably worth more on a per token basis than the vast oceans of human generated corpus junk.
realityChemist@reddit
That's a fair point. I don't have the subject matter expertise to actually know if o1 data is actually qualitatively better in a way that would allow models trained on it to avoid MAD, but I'll just give you the benefit of the doubt and assume that you do.
My expectation would be that you still need humans in the loop to be a referent back to reality to avoid long-term issues (which it sounds like you think too), but it may well be that you can do better using carefully curated human inputs and some augmentation by state of the art models than we currently do by using just a huge bolus of whatever (mostly junk) human data we can get.
great_waldini@reddit
I don't have immense subject matter expertise either, to be clear - probably no deeper than you. I'm not a salaried engineer working on SOTA - just a programmer who's taken interest in NNs since roughly the convolutional / ResNet paradigm.
That said I don't think we have to make any questionable assumptions to look at the o1 output and make a confident guess that it's extremely high quality training data for other LLMs relative to other training data available.
Not only does it perform at PhD level for many scientific analysis / reasoning tasks (or so OpenAI claims - yet to be fully fleshed out into consensus acceptance), but it literally explains its Chain-of-Thought process it used to arrive at the conclusions it gives.
To the extent that "reasoning" (or even actual reasoning) can emerge as a property of digital NNs, I don't think it's unreasonable to expect that o1 outputs (consisting of high quality answers AND their supporting CoT) is going to be very valuable training data towards eliciting similar emergent reasoning in future models. Especially compared to lower quality (/less accurate) responses from 4o or humans which do *not* generally include thorough CoT supporting the content of the text.
Yeah absolutely I agree. I'm in no way suggesting that synthetic data is completely replacing existing training corpuses - broad variety is still essential and the human corpuses remain extremely relevant. Especially because the end LLMs need to be able to infer the "real question" from someone's off-the-cuff, poorly-worded, poorly-formatted query, the model also needs to be fluent in "shitty text" too.
So I think you said it right in that synthetic data is for augmenting, not replacing. Right now, o1 is essentially just a lang-chain (over simplification obviously but in principle its multiple queries connected in a structured way). That's expensive - its a lot of compute to run 10 inferences passes for a single query.
But because you're recovering and storing the CoT behind each procedural answer, and can give that all as plain text in training the next LLM after o1, I would expect the reasoning property to improve in the sense that the next model may require fewer inference passes to reach the same quality conclusion. If that makes sense. Like maybe each step can be less atomic, and more like combining what is currently multiple steps.
And finally, of course its possible (likely even) that seeing an immense number of CoT examples leading to high quality answers likely augments whatever emergent reasoning is already taking place in single-pass inference on models like 4o.
Anyways, sorry for long ass text, got carried away. TLDR yeah i dont think were disagreeing lol.
Tail_Nom@reddit
Logic-incest compounds errors. It's like everyone forgets what happens when you cyclically plug generated output back into the input. Blind leading the blind, audio feedback loop, xerox of a xerox, et cetera.
Besides, "better" doesn't mean what you think it means. AI generated content is better at seeming effective, and producing output that seems right. The smooth-brain idea is that you can make something that seems right enough that it must be considered correct, which falls apart when you try to do something new, complex, or creative.
AI can generate boilerplate junk, the kind of thing we already know how to do but just don't want to do manually because it's tedious. And it is a very impressive method for doing that. When we stop jerking each other off over our retro-future fantasies and corpo-dystopian fever dreams, we might actually be able to build useful tools with it.
great_waldini@reddit
The vast majority of human generated text is subject to this same regurgitation phenomena. The share of organically sourced text which was generated by a human who was thinking critically and rigorously is vanishingly small.
Still though, excessive homogeneity in any data source will of course lead to model collapse.
When I said “better” I was referring to Claude, 4o, and now especially o1. Yes models can and do bullshit, and so do humans all the time.
Now, if you pose a truly novel coding problem to both o1 and the wisdom of the crowd that is StackExchange, I expect that the humans would do better in ultimately arriving at a valid solution. And a highly competent programmer (as one example use case) is still unquestionably better and more reliable than even o1 - of this I have little doubt.
However, the scope of the conversation is training data. However often does a highly competent human programmer lay out the exact thought process they used to arrive at their elegant solution? Rarely if ever. That doesn’t make it a very convenient data source because the supply is heavily constrained. LLMs rely on large quantities of examples, and you want to maximize quality across those examples.
So if we’re comparing between a stack exchange corpus plus a Reddit programming question corpus plus GitHub issue discussions versus a giant corpus of coding problems solved by o1 including the the CoT steps… all I’m saying is it’s not clear to me that the human generated text is better training data at this point.
A year ago, I’d say the human text was obviously higher quality (again, as a training data source).
josluivivgar@reddit
no, like there's an actual issue with feeding AI data to AI, it's a known problem that quickly makes LLMs basically break
https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/
there are other articles but I just grabbed the first one that wasn't paywalled.
point is Ai starts hallucinating harder and spews garbage when fed AI data
great_waldini@reddit
Yes, I understand the problems with using synthetic data that is too homogenous, in addition to the ever present garbage-in-garbage-out dynamic.
I’m not suggesting any serious organization is training strictly on synthetic data either.
I’m merely suggesting that on a per token or per example basis, outputs of 4o or especially o1 are likely more useful datum than human generated answers to the same question, which tend to be less structured and less complete (I.e. assume prerequisite knowledge or conceptual familiarity, etc)
Specialist_Brain841@reddit
model collapse
Codex_Dev@reddit
For coding it still has a long way to go. Many times it will claim something works when the code it provides has obvious errors that break.
Right now it can get you about 70% of the way there when creating code but you have to treat all of it with suspicion. It still saves a lot of time and productivity so I don’t think it’s going anywhere.
great_waldini@reddit
I’m not saying LLMs are now better 0-shot coders than a competent human being. I was only referring to training data.
csmithku2013@reddit
Also, this won’t stop AI generated code from being used. As far as I’m aware, all AI detection tools have been shown to be snake oil this far.
anengineerandacat@reddit
TBH not sure how they could detect... AI generated code once reviewed and polished up looks like regular ole code and if it's simple work (ie. scaffolding) it's pretty much indistinguishable from someone else.
It's not leaving behind any ligatures or anything, simply a guessing game; best you can do is check for plagiarism but in the programming sphere where languages having naming conventions and style guides that's going to be a pretty big challenge.
Imagine code to like... scaffold a Spring Boot app... it's all going to look the same by design.
_BreakingGood_@reddit
Luckily this is a king of a self-solving problem.
Detecting AI generated output is only relevant for the next like, 3-5 years. After which point there will be no reason for a human to actually write code at all anymore, because the AI code will be created faster, more accurately, and cheaper.
This is only a problem during the "software engineers haven't been fully replaced and we're still trying to train new ones" phase. Same for the college issue of "AI wrote my college essay." Soon enough, we'll wonder why college even exists when AI can do it better and faster 100% of the time.
anengineerandacat@reddit
This is an exaggeration, and highly unlikely especially within the next "3-5 years".
Software Development isn't "just" programming... the bulk of my career has been essentially requirements gathering and understanding the actual problem while looking at the available technology to me and creating a solution around it.
All Software Developers / Engineers / Coders / Programmers / etc. are essentially individuals who use software to create solutions; that is in essence our "work".
Will AI be utilized? Most likely, today it's basically only useful for scaffolding and quickly grabbing documentation / tooling to do some light automation.
Will it get better? Most likely... I use ChatGPT quite a bit when learning new languages, bounce ideas off of, and essentially rubber-duck from but it's not my "sole" tool I still leverage peers and mentors, etc because I don't "trust" it fully because it has returned totally nonsensical information.
LLM's are "not" the end-all-be-all to AI, their primary "task" is to return a result even if it's incorrect and that makes them inherently flawed.
Until I encounter an AI solution that's quite literally asking me back questions for additional input for clarification and refinement is when I'll start to become concerned.
csmithku2013@reddit
Microsoft copilot studio does this, but what it lacks is nuance. It’ll inquire about public data sources to be able to fuel itself and whatnot, but it doesn’t ever bother to ask the question if a software solution is even required. The best amount of code is always zero, which is pretty much the immediate fail case for AI generated code.
anengineerandacat@reddit
Yeah, I feel like AI is how painters felt when digital painting software came into existence.
There was also the fear that content creators would go out of work when procedural generation tools came into existence and yet all it did was allow for folks to do more complex things.
AI will simply raise the bar for the existing workforce while shifting a few industries around in terms of requiring highly skilled workers to lesser skilled workers.
It'll impact a generation of workers and then things will normalize and life will continue.
csmithku2013@reddit
AI code generation is very good at producing a result for a given question, but terrible at problem solving, as that’s not what they attempt to do. AI will make a great reference tool in 3-5 years, but I would not worry about new grad software engineers having good long careers in this industry quite yet.
_BreakingGood_@reddit
I don't believe anybody who has used o1 would still believe this is true. It's literally an advanced reasoning model which has shown it can scale to a virtually unbounded rate with more computer. It solves a PhD thesis a dozen prompts. It increased performance on the Codeforces exam from 20% to over 80% (prompting the ban referenced in the article in this post.)
And this is the preview model. The full release is on the slate for the next month or two. And after that, the ChatGPT 5 version is right around the corner.
Meanwhile, Microsoft announced it's AI agent capabilities have achieved the ability to perform 20% of core functionality in Windows. Within 3-5 years we will have:
An advanced reasoning model on GPT 5 (or several iterations past GPT 5)
Agent software capable of fully controlling a Windows environment
Once we can do that, what is left for you, the human, to do?
MIC132@reddit
I love how you are this pessimistic/realistic (depending on the viewpoint) about employment/etc. but for some reason optimistic enough to think that UBI is going to be a thing.
_BreakingGood_@reddit
UBI will need to be a thing. Either we get UBI or we all simply starve to death.
MIC132@reddit
Oh I agree that with increased automation it's probably needed (at lest haven't heard good alternatives).
That doesn't mean it will happen though, just like many other "either we X or we are fucked" things.
csmithku2013@reddit
For starters, verify the problem justifiably needs solved or if not, provide recommendations to the business process that meet the needs of the requester.
This type of response lacks a significant amount of understanding about what senior engineers actually do during their day. We’re not typically fiddling around in algorithms optimizing the last hz out of compute. We, generally speaking, don’t even care about that until performance is actually the problem we’re trying to solve.
Instead, we’re in meetings, and we’re translating business requirements into an actionable plan, and we’re handling triage when things aren’t clear what went wrong when we receive a shrug report. Can AI help supplement that work? Absolutely. Will it ever be something capable of all of it? Goodness no.
You’re completely ignoring several things, but let’s start with the immediate nonstarter: airgapped networks still need software, but AI models won’t work without having public knowledge sources.
-Shush-@reddit
I highly doubt you can fully replace programmers, but even if what you say is true and does end up happening, I'll fight to be the last engineer to be replaced.
MuonManLaserJab@reddit
Smashing the looms might help!
GoodGame2EZ@reddit
You think the downvotes are from people having an existential crisis? We know actual reasoning is coming at some point. This is not it. This is still token guessing, just a bit more refined. As with every version. I guarantee hallucinations, which is pretty much unacceptable in a true reasoning model. People aren't having a crisis. They just understand the reality of the situation.
josluivivgar@reddit
so obviously either of us could be wrong because none of us can predict the future.
but I don't think that's true simply because it just ISN"T there yet.
predicting what the right answer will be is not the same as reasoning.
fundamentally it's lacking certain aspects of human thinking that it can't do, it might answer right 88% of the time, but when it's wrong it's dead wrong, we can have heuristics that approach the correct answer on really hard problems.
we can realize something doesn't add up with our results and at least know that, and that's part of the engineering process that just can't be replicated right now. and I don't think the current model can do that fundamentally, I think it'll take another model or a big breakthrough on the current model for it to actually replace engineers.
could it make it so that you need less engineers and instead pay openAI for their model to use as part of your engineering workflow?
maybe, but using AI is also very very expensive, and it needs validating, so it might end up being redundant to use AI over humans, if you still need humans, spend about the same amount of money and given yourself bad PR for it.
also the let's put AI into everything craze, seems completely silly to me as there's not a lot of ways to monetize LLMs outside of what OpenAi does (which costs billions) to actually put the models to good use for useful reliable solutions to problems is just not currently that feasible.
instead you end up giving engineers more work by doing that, or you could modify it into a tool like code autocomplete that vscode does and that's actually useful and might make programming easier, but you still need programmers.
tldr; great tech, not there yet, fundamentally probably never there with just LLMs unless a breakthrough happens or they integrate it with a new model.
JustAdmitYourWrong@reddit
Hahaha, good one
Ok-Yogurt2360@reddit
We also don't need children anymore because ai makes way better drawings for on my fridge.
MuonManLaserJab@reddit
Do you only care about children because of the drawings? Do you produce code because you want conditional love in your life? This analogy doesn't make sense.
MuonManLaserJab@reddit
If you're confused by the downvotes, remember that powered flight will take us a million years.
josluivivgar@reddit
dude we're not even remotely close to AI being able to code properly because it can't understand the actual problem, LLMs are great at pretending to be hunan, but they can't actually think.
it's gonna take huge leaps in the current model or a completely different model to be able to do the job of a software engineer.
I think 3-5 years is ridiculously optimistic, AI money will most likely run out in 3-5 years when they realize it's not profitable and it's not replacing anyone lol
then it'll stay as research and a niche that, but not used for everything, until the next breakthrough.
maybe the next breakthrough will be the one that replaces us, or maybe it will just be the same hype cycle of higher ups not understanding the technology and blindly wishing it can replace us
Appropriate_Sale_626@reddit
AI and a human are asked to write a function which checks for 3 things and returns the result, both code samples work perfectly fine and are as optimal as they can be, commented, documented etc. How do you find out which is human written and which is ai? You can't. Unless you want to start injecting metadata into fonts and editors and all that but most programmers wouldn't want anything else meddling with their ability to debug...
jay791@reddit
Plagiarism, Lol,
tyros@reddit
Yeah, this is the ball that can't stop rolling. No matter their efforts to prevent AI generated code to get back into AI training models, it will and in a few years it will make LLMs useless as they will pit out regurgitated garbage.
SmokeyDBear@reddit
It's programming Poe's law. Companies have stuffed half-assed programmers into chairs and make more competent people clean up the mess they make so they can depress wages for so long that the crappy code the AI writes is indistinguishable from something a company has paid a human to write in the past.
ZirePhiinix@reddit
It doesn't take much to defeat AI detection code.
Setup am AI code generator, tell it to keep generating code until it can no longer be detected as AI generated.
Then we end the human race.
kairos@reddit
OTOH, you can also set up an AI detection tool so that it re-reviews the code until it can say that it was generated by AI. (/s)
mich160@reddit
I can see bots are doing viral marketing now on reddit
Worth_Trust_3825@reddit
baby's first astroturfing
PublicFurryAccount@reddit
Always have been.
dasdull@reddit
The title is misleading. Not all AI-written code is banned, copilot is still allowed. They just ban putting the problem statement into AIs
claythearc@reddit
They’re just effectively banning the useful models. 3.5 Turbo, what copilot runs, is pretty mid.
jfcarr@reddit
History repeats...
In the 18th century Luddites destroyed textile factories since they were worried about decreases in pay for textile workers and a reduction in product quality. Some historians think that the hostility wasn't really against the machines themselves but that disgruntled workers found them an easy target for protesting unemployment and lowered wages.
sothatsit@reddit
Yeah, the luddite response to AI is overwhelming at times. I can't even imagine how it will get when things really start to change.
The fact that people still refer to LLMs as "stochastic parrots" is mind-boggling to me.
SwingOutStateMachine@reddit
It's a technical description of how they work, not a pejorative. LLMs are inherently statistical machines that are only capable of re-producing what they have observed, not novel concepts.
sothatsit@reddit
You, like many people, use the term "stochastic parrots" to say that the LLMs can't do anything novel. You don't use it to refer to the underlying mechanisms of the machine.
LLMs can clearly solve novel problems. The new problems they benchmark on for the IMO, Codeforces, or other benchmarks are novel problems. They share similarities to existing problems, but they are not in the training dataset. AKA, they are novel.
However, even if you assume LLMs are just "stochastic parrots", they are still absolutely incredible! They can solve most IMO math problems. They can solve most Codeforces problems. They can answer most graduate-level questions. That is unbelievable.
SwingOutStateMachine@reddit
The reason for that is because the mechanisms of a machine imply the capabilities of a machine. LLMs are inherently statistical by nature, and are inherently incapable of reasoning. All they are capable of doing is learning streams of tokens, and producing statistically likely streams of tokens in response to an input. Nothing more, nothing less.
When an LLM solves a "novel" problem, as you say, it's because they are similar to existing problems. That means, by definition, that it's not a novel problem! Even if the exact problem isn't in their dataset, a similar one means that it is statistically likely to find a solution.
In fact, the examples of problems that you gave are those that are statistically easy to train for. IMO problems follow similar patterns, as do Codeforce problems. If you set an LLM on an actual novel mathematical or computation problem, they break down very quickly.
This is why they seem "incredible". There's a lot of computational effort, training time, and training material that has gone into these machines. They have been tuned and tuned and tuned until the statistics are almost perfect - or seem perfect. They work well for problems where there are lots of examples and training sets, but are incapable of reasoning towards completely novel problems.
sothatsit@reddit
If they're not novel, then it's not impressive that humans can solve the problems.
Oh wait? It's very impressive that humans can solve these problems. Hmmm. Maybe the LLMs are doing something that is impressive.
Here's another example for you: I wanted to estimate how many spoonfuls it would take to eat a bowl of beans. Is this in the training dataset? No way.
And yet, it did a remarkable job at it: https://chatgpt.com/share/66e83072-4e9c-8001-86b0-01f12dd9cc15
Now, is this just combining geometry and intuititions about beans? Sure. But, it's still novel, since it hasn't been done before.
It seems that the contention is around the definition of "novel". To me, novelty is anything that has not been done before. But it seems, to people here, novelty is doing something completely inhuman. Based on that, I don't know what you would consider novel.
Maybe something like the ARC-AGI prize, which AIs are gradually getting better at? https://arcprize.org/
SwingOutStateMachine@reddit
Again, this is just statistics. It is impressive - just as any complex system is impressive - but it is not intelligence and it is not reasoning.
sothatsit@reddit
What would it take for you to be convinced it was reasoning?
SwingOutStateMachine@reddit
An entirely different architecture and technical foundation. My point is that LLMs by definition cannot reason.
sothatsit@reddit
Ah, you seem to define reasoning based upon belief or faith in human reasoning, not in capability.
Since you can never falsify your belief, you can always say you were right, even if future LLMs can solve 99% of knowledge work. You can always point to them and say, "well, they're just doing the same tasks humans could already do and that they've already seen, so not reasoning! They're just combining known skills of maths, logic, coding, English, etc..."
It's a particularly nice corner to place yourself in if you want to be right, since no one can prove you wrong. But it's not very useful to be in a corner.
For example: I could also say that humans do not have free will by definition, because we are just a bunch of neurons firing in a chemical soup in our brains and bodies. Therefore, we could just be simulated and are just carrying out a pre-defined future for the universe based on physics. It impressively imitates free will, but it's just a trick - just as LLMs reasoning apparently just imitates reasoning to you.
Amiron49@reddit
When it would produce valid Unity Code without constantly hallucinating non existent APIs for anything that goes beyond beginner problems.
Or at least not misuse Lerp when it's wrong just because every other unity beginner code also makes that mistake.
So basically once it demonstrates that it will use the correct solution despite the overwhelming data pointing it towards a different one.
BillyTheClub@reddit
The luddites were actually cool and dope. They were demonized by an owner class attempting to exploit them and illegally steal from them. They had a legal right to be involved in decision making in their factories but the owners shut them out. When the government refused to honor their legal rights, they took things into their own hands. They had to be very secretive about their activities and membership so relatively little of their philosophy and goals survives today.
Moltenlava5@reddit
Codeforces problems aren't providing any meaningful economic output, I really don't see the point of your argument. Do you also argue that bots should be allowed to compete in chess tournaments?
CanvasFanatic@reddit
Oh STFU
madman24k@reddit
I think that if you're going to have an open book/internet competition for coding, then you should have AI available to you. At that point, the competition is less about your ability to code, and more about using the resources available to you. If you're wanting to challenge people based on their skill, take away the internet and only allow access to documentation.
mcpower_@reddit
Can we ban AI-written articles like this one from /r/programming too?
wolfpack_charlie@reddit
And accidental body horror AI generated images too please
Tail_Nom@reddit
That surreal bullshit is the only thing it's good at! You can't ban that!
Seriously, though, find an okay model that offers some free credits and spend an afternoon explicitly trying to get it to generate some weird shit. It's great ♡
zigs@reddit
I only see a guy floating in the air and a guy awkwardly holding his hand in front of his crotch
Where's the body horror?
wolfpack_charlie@reddit
All the faces in the back, guy under entry sign has an extra hotdog finger, brown suit's hands.
But also just talking in general about the abomination that is AI image generation. Heads on backwards, background fusing with foreground, mangled hands, mangled faces, too many limbs, limbs that are attached the wrong way, and so on.
NiskaHiska@reddit
One of the background characters legs is shorter than the other, guy crossing his arms has his finger snapped, a couple of melting faces in the background etc.
bonus: the barrier behind chatgpt robot is inconsistent
wRAR_@reddit
It's clear a blanket ban on favtutor isn't something the mods want.
MainEditor0@reddit
Just admit that programming is cooked and become to be something like chess or go (game)...
mosaic_hops@reddit
Programming isn’t cooked. AI is a tool that needs to be wielded by a programmer. It mighy make some programming tasks more efficient, which might mean a team of 10 programmers can do the work of 12. Which means the company grows faster and hires more programmers.
Majik_Sheff@reddit
I remember being young...
dats_cool@reddit
Sorry and what do you do for a living that won't be disrupted by AI?
mosaic_hops@reddit
Did backhoes and nail guns put construction workers out of work? Or, perhaps closer to home, compilers and higher level languages put programmers out of work? No, it removed a largely menial task and increased developer productivity. AI is no different - when it works at all it automates away some of the more menial bits of your work allowing you to focus more on the big picture. AI is kind of like a compiler. A really bad one that needs to be double and triple checked and where you need to keep guessing at what needs to go into it in order to get the right thing out of it. But sometimes what it spits out is actually usable and saved a little bit of time.
Big_Combination9890@reddit
In other totally surprising news, Olympic swimming contests don't allow Jet-Skis or Sea-Scooters.
szmate1618@reddit
It is surprising, it wasn't very long ago when AI was fucking useless in a programming competition. Most models still are.
Additional-Bee1379@reddit
I feel like some people just want it to be useless forever. The reality is the benchmarks are getting better every couple of months.
It goes like this with every benchmark. First its "Computers will never be able to do this" then it's "It still sucks because there is some human better" and then "Well it was a shitty toy problem anyway".
crazyeddie123@reddit
Some people want to keep getting paid to crank out code and not become product managers who sometimes ask AI to code for them.
Big_Combination9890@reddit
It has real-life value, and has had for a long time. The problem is not that Generative AI is useless, its that it isn't magic, but is often sold as it it were.
I use generative AI every day in my dev work. It rocks. Complex SQL statements from nat. langugae, writing test cases, generating code documentation, creating boilerplate, extending existing APIs, ... there are ALOT of usecases that it's really really good at, and in fact has been good at for a long time. Hell, text-davinci-0003 was already amazing.
But, for some unfathomable reason, some people seem to want it to be magic, an actual artificial intelligence, that just does SciFi things...and sorry no sorry, but that just isn't going to happen.
Regardless what some benchmarks, which mean less then nothing in real world applications, say, fundamentally, LLMs today are not much different from their earliest tranformer based ancestors: They are sequence completion engines. That's it. They cannot reason, they cannot think, they cannot understand. Anyone claiming otherwise, should really really really go to the StatQuest youtube channel and learn the basics of what an autoregressive transformer model is and does, and then come back to the discussion.
They can mimic some of these things to a degree where they become useful for certain applications, but please accept that an LLM will not magically start thinking just because we let it eat its own output for some more iterations, for much the same reason why a horse carriage will not magically turn into an airplane, no matter how well aged the wood is, and how well bred the horses.
Additional-Bee1379@reddit
that doesn't seem entirely accurate as a lot of these benchmarks are 3rd party and things such as "solves this many phd level math problems" seems a lot more useful than some arbitrary speed comparison.
Honestly I think the discussion whether or not it understands is pointless, the only thing I think matters is how often it is right and what tasks it can complete and there are critical values for this. I think saying is just a completion engine isn't helpful because I don't see a fundamental limit to capability from that fact. Nobody thinks LLM will magically become sentient from doing the same thing, which is why the implementations aren't sitting still either, o1 is fundamentally training differently than the preceding models.
WelpSigh@reddit
The benchmarks are largely meaningless. Being able to solve problems that it can (in essence) look up solutions to on GitHub isn't very interesting. I can do that too. It still regularly generates code that can only be described as unhinged, and it will keep doing that until its training set no longer contains millions of lined of bad code for it to model (this will never occur).
unicynicist@reddit
Funny how humans don’t seem to suffer from this. Do you think there's something about how humans think about and author code that's inherently superior, that can never be replicated by a machine?
I mean, I’ve read and written my share of awful code, and somehow I’m still employed. Maybe it’s my knack for generating convincing bullshit in meetings. Which is mildly troubling because people seem to think convincing bullshit is all that LLMs are good for.
Additional-Bee1379@reddit
The benchmarks are actively countering this by including new questions, you can't just claim it is memorizing things. It was also tested on new math questions such as this years AIME test where it scored 83%-93%, up from 12% from GPT4o
Ok, but nobody is making this claim in the first place.
ESHKUN@reddit
More like jet skis built from human body parts that have like a 10% chance of just exploding.
Additional-Bee1379@reddit
It's more similar to chess computers being banned in chess tournaments because they are now good enough to actually be better than humans.
picklesTommyPickles@reddit
LOL clearly you haven’t actually read the output of chat GPT and other generative code tooling. You’re handed the scaffold of something that could maybe work but also might have completely missed the mark. Humans still have to go through every line and make corrections and verify its functionality.
At best, chat GPT is a code scaffold tool that can “understand” English prompts instead of explicit generation parameters
Additional-Bee1379@reddit
What? These things are simply benchmarked. It scored in the 89%-93% percentile when compared to humans and that is without human interference.
a_marklar@reddit
Sure, but you can take it and say 'heard that before'. So many fantastic benchmarks, much wow.
Additional-Bee1379@reddit
What is your point? AI is also starting the seep into practical applications. Alphafold opened an entire new field of biochemistry, Copilot for programming is getting better and better, a lot of media and art jobs are being displaced.
a_marklar@reddit
My point is that those benchmark percentiles are not something most people care about. The person you're replying to effectively said "who cares about benchmarks this is how it's worked for me in real life".
Big_Combination9890@reddit
Yes, that was pretty much what I just said ... ?
Additional-Bee1379@reddit
The point is there was no need to ban chess computers when chess computers were just bad.
Big_Combination9890@reddit
Also it would have been pretty difficult to drag an IBM Model 5150 into the room trailing a powercord, without anyone noticing.
sothatsit@reddit
This is the crossover point from human dominance to AI dominance in coding competitions.
That's what is interesting here.
Before o1, they didn't observe people gaining serious advantages using AI. Now, they have observed someone using o1-mini crushing their competition. This represents a meaningful gain in the capabilities of AI, even if it is not all that surprising that they banned it.
sothatsit@reddit
Bruh, why the downvotes?
This is the same situation that Chess faced. In the mid-2000s people started to accuse others of computer assistance and they had to start instituting rules that computer assistance was banned in chess competitions. This marked an interesting crossover point where people had to admit that chess engines were no longer just useful for learning and preparation, but could also be used to cheat.
It's an interesting point in time that marked the start of cheat detection methods, and the ceding of dominance to consumer computers. It took a decade to go from a supercomputer, Deep Blue, defeating Gary Kasparov to computer assistance actually becoming an issue relevant in competitions. It's an interesting point in time!
Big_Combination9890@reddit
Because we aren't at "the crossover point from human dominance to AI dominance in coding competitions.", we are at the crossover point to "people losing in these competitions might start complaining about AI being used, and no one wants to deal with that sort of internet-drama."
Sorry to maybe burst a bubble here, but AI still sucks for implementing algorithms.
If it has something in its training data, sure, there is a good chance the sequence-completing stochastic parrot will autoregressivley regurgitate enough of that in a coherent-enough way to actually solve the problem minus some edge cases, but it's still mostly nonsense.
sothatsit@reddit
Ah, the willful ignorance is strong I see.
From OpenAI: "Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions."
I'm sorry if this offends your sensibilities, but this is the time.
TheMostUser@reddit
Most of the people on the site are causals so 89 percentile is not as high as it sounds.
It's elo is around 1800 which is very very far from winning any competition (and even a lot lower than my elo)
For comparison the highest rated hu has an elo of a bit over 4000 (https://codeforces.com/profile/tourist)
Additional-Bee1379@reddit
The claim are we are at the crossover point, which we are, AI is not yet winning but it is starting to be a competitor.
sothatsit@reddit
Exactly!
codethulu@reddit
this just says that codeforces competitions arent interesting problems.
there is an interesting corolary that many industry problems are also not interesting problems.
but they are correct. these machines cannot reason. they lack encoding and process for it. simple matter of information theory.
there is no magic here. just a machine working well on toy problems designed to evaluate human performance not machine performance. humans operate under a different set of constraints, so this should not be surprising.
the machine is not god.
Additional-Bee1379@reddit
You are attacking claims that are not made by anyone.
Big_Combination9890@reddit
I guess there are easier way to showcase that one doesn't have an argument than immediately reaching for the ad hominem, but sure, whatever gets the job done.
Additional-Bee1379@reddit
We absolutely are, the claims are the AI scored better than 93% of competitors and was just below "master" level.
gabrielmuriens@reddit
The cope is strong with this sub.
Look, mate. I really, really don't want to lose my job to the proliferation of these things. Or at least I hope that I will be able to make sufficient use of them in my micro-company that I'll remain competitive.
But this new crop of LLM models that are using an imitation of reflective thinking, and of which the beta OpenAI O1 models are the first examples of, can already solve small to mid-size programming tasks more efficiently and often better than human developers, and not only human developers can.
Coding challanges absolutely fall into this category. The O1 models are in fact, demonstrably, better at solving these problems than the majority of human profession programmers are. I have also seen them solve graduate-level physics problems - problems that would take a PhD student days to solve. Other, more specialized models can already place better on olympiad level math problems than 2/3rds of the competitors.
These are facts. I don't think your reasoning based on regurgitating the "regurgitation" argument has standing anymore.
On this current date, publicly available LLMs can outperform the average collage educated human at any intellectual task given they can sufficiently interface with the problem (e.g. they don't have access to tools like email, google, or a virtual machine in which to build software, for example). And in many cases, they don't just outperform the average person, but they often outperform specialists as well. And you can see their reasoning as they are working it out.
So no, the time of the "sequence-completing stochastic parrots" are over. Or maybe the "sequence-completing stochastic parrots" are simply surpassing human-level intelligence as it relates to doing economically useful tasks.
I am sorry, but from everything I've seen, these are the facts. And the implications fucking scare me. But there is no use in putting our heads in the sand, however these things will turn out.
atred@reddit
Programming is about reaching the destination, not about which tool you use to get there. It's not at all like an Olympic sport, so it's hard to draw parallels.
BigBoiiInDaCluhb@reddit
Tour De France is about covering a distance in the least time too, doesn't mean allowing crystal meth is fair sport.
disoculated@reddit
But meth is ok in code competitions… right? I mean, it is, right? ::looking around nervously at all these empty Adderall bottles::
Big_Combination9890@reddit
I'm not sure if you have ever googled "Codeforces", or read the actual article OP links to, but Codeforces is a platform for coding COMPETITIONS. So yes, how you get there matters.
Because of course I can drive on my motorbike to go 42.195 km really quickly, but I'm pretty sure "it's about reaching the destination, not about which tool I uses to get here" isn't gonna sit too well with judges at a Marathon.
atred@reddit
It depends on the type competition, I mean things change, at some point in time people were not allowed to take a math calculator at exams, now that's almost a requirement.
Cheraldenine@reddit
Not just that, they're taking a bold stance not to allow them.
Merry-Lane@reddit
It’s more like "doping drugs not allowed in the Tour de France".
The stance is full of irony.
generally-speaking@reddit
They should, would make the whole thing a lot more entertaining. Especially because you get banned for entering other lanes, and turning a jetski around in an olympic swimming lane without crossing in to the adjacent lanes would be really difficult.
Pharisaeus@reddit
"Chess engines banned from chess competitions" duh
TestExperimentBeach@reddit
And as with chess competitions, people who take including AI in them seriously really miss the point of it. It's as though the human element isn't in the room for some folk.
Hawk_Irontusk@reddit
This article looks like it was written by AI.
faustoc5@reddit
A much better approach to this issue is to create a new contest category of really harder problems were AI assistence is required: multi layer, multi language problems were you not only solve algorithms but create whole architectures composed of different layers and programs made in different languages and platforms
jhill515@reddit
My professional stance: GOOD!
Rationale: Coding competitions are our flavor of athletic competition. I expect to go in with nothing but my experience, guile, and cunning. And so in my youth, I trained thusly in the face of real cheating I'd witnessed by even my fellow classmates competing against me. All because they wanted a gold star in their trophy case, meanwhile, all I wanted to do was to measure myself (and maybe surpass) my respected colleagues in a fair test of skill.
I also want to say this, as a hiring manager at various high-tech startups. I and everyone I have worked with do not give two shits about how you placed in any competition; your activity is merely a talking point to break the ice. Seriously, I've actually trashed resumes that were loaded with so many verbal trophies that truly have no consequence in the professional world. And that was regardless of their professional achievements. I hate it when people take a kindhearted game and gloat about their success. However, I will always respect the person who went into the arena and came out holding their heads held high because they know the taste of a hard-fought victory and the cost to achieve it. And they know the taste of bitter defeat because they didn't measure up to where they thought they were, and yet persevered. Those are fantastic talking points that help me get to know the real person before I consider them on my teams.
hextree@reddit
So, is this just an honour-system ban? Or are they actually attempting to detect use of AI and ban users for it? If the latter, this is a slippery slope, studies have already shown that the best 'AI detection tools' simply don't work with any realistic accuracy, and are leading to students in Universities getting unfairly penalised for code they've written themselves.
tyros@reddit
The whole concept of "AI detection tools" is a contradiction. The whole point of LLM output is to be indistinguishable from human. If it achieved its goal, then by definition, you can't feed it back to AI and be able to tell it's AI generated. If it can, then it didn't achieve its goal in the first step.
pnedito@reddit
This is like saying a regular expression is better at reading than a human.
Additional-Bee1379@reddit
The article is shitty but the point stands. For the first time AI actually scored better than most participants with beating 93% of competitors. As always today's AI performance is the worst it will ever be, we are entering the era where AI actually outperforms humans on these leetcode problems, make of that what you will but it is happening.
Majik_Sheff@reddit
I would propose that this says more about the overall quality of the human output.
tdammers@reddit
And possibly also about the testing method.
Additional-Bee1379@reddit
Ok, so how long do you think it will be before the top humans are beaten? For other fields the time between beating most humans and beating the top of the field has been pretty short.