My old boss made a terrible decision based on AI benchmarks and turns out I was doing the same thing
Posted by Wise_Slice6303@reddit | ExperiencedDevs | View on Reddit | 25 comments
My old boss fired his entire frontend team last month cause he saw some demos and thought one backend dev could cover everything. Well 3 weeks later Im cleaning up the mess, site broken on mobile, zero accessibility, nobody knowing how anything works.
Watching him make that call based on numbers he didnt understand stuck with me. Turns out I was doing the same thing when I picked my own coding model. Ive been on GLM since 4.7, switched cause it was cheaper and worked fine. When GLM 5.1 came out it felt like a real upgrade so i stuck with it.
GPT-5.5 came out the other day so i checked SWE-Bench Pro and its 58.6 vs 58.4 for GLM-5.1, basicaly the same score. Both numbers published by the companies themselves and the pricing gap between them keeps shrinking too.
At this point idk if Im on GLM 5.1 cause its better or just cause its what i know. Same trap my old boss fell into just from the other side. Running my own tests this week cause company benchmarks mean about as much as self reported experience on a resume.
eyesopen18819@reddit
This reads like ad slop for GLM
aeroverra@reddit
I don't understand how this is still up.
Man Reddit is really starting to lose its touch with all these bots
therealslimshady1234@reddit
AI is very good at producing bad to average code fast. Its essentially a tech debt machine. At best you can do maintenance with it and tiny features.
robhaswell@reddit
That's just not true. You absolutely can produce well-constructed and maintainable code. You just have to put in more effort in guiding what you want and knowing what good output looks like. Garbage in -> garbage out. A senior dev driving an AI is very different from a junior doing the same thing.
aeroverra@reddit
I have come to the conclusion that maybe it can but the amount of time to set it up and continue to babysit it is not much better than doing it myself and learning more things along the way.
therealslimshady1234@reddit
Baby sitting an LLM is not the genius move you think it is, all the downsides of AI are still present, you are just slower than before
Altruistic-Bat-9070@reddit
Honestly anyone that is still saying AI can't make good code isn't using copilot CLI and using claude.
I think if you are starting a new codebase then sure you probably want a human to do that and to come up with the full structure for where things go etc. If you are jumping into a mature codebase and you let copilot scour through everything first though it is making some great stuff.
Also done code is > perfect code.
therealslimshady1234@reddit
I lost IQ just reading this
Altruistic-Bat-9070@reddit
Well you're a web dev so it may not work as well for you, i have been less impressed with it when doing FE work.
Ok-Yogurt2360@reddit
The problem with code is that "done code" is often worse than "no code". Bad code can have costly consequences.
Altruistic-Bat-9070@reddit
It just isn’t.
I have worked in code bases with technical debt accumulated over 25 years that would make many here cry. If those had been coded by current claude the debt would be significantly lower. At least when coding in python, i have been less impressed with its FE coding.
TheBoringDev@reddit
I’m using Claude, bad to average is about right.
Altruistic-Bat-9070@reddit
Are you using 4.5 or sth
modaldere@reddit
To be fair, I'm also good at producing bad to average code, just slower.
dEEkAy2k9@reddit
I see AI as a starting point. Another tool in your toolbox to get stuff done. Prompt what you wanna do, what might be issues and what suggestions/best practices there are. Then refine upon this.
DarthCalumnious@reddit
I don't think it's too hot a take to say that the model benchmark numbers don't have any bearing on your situation - you talk of a mess with nobody knowing how anything works and no accountability.
The model is a footnote, the challenge before you and you manager is to work out how to integrate AI into your engineering culture without it being a headless shambling train wreck. This means discipline and accountability for the work of our mindless llm minions.
Wise_Slice6303@reddit (OP)
Fair point. His problem wasn't picking the wrong model, it was thinking the model could replace institutional knowledge and teamwork. I'm not making that mistake but I am guilty of picking tools based on vibes and numbers instead of real world testing. Both of us just trusted the marketing I guess, different scale but same lazy thinking...
CodelinesNL@reddit
What the heck is this comparison even?
aymswick@reddit
Fake ass LLM slop
BlueDolphinCute@reddit
Inertia is real tho. Took me a while to admit I was sticking with my model out of habit not because I actually compared anything
Witty_Indication2017@reddit
Once the scores are this close the only benchmark that matters is your own codebase. Been saying this for a while
iiiio__oiiii@reddit
Strangely, different models have different personality and developer experiences also depends on the match between the engineer and the model. I cannot quantify it yet, but I have favourite models that just speak to me even without customisation. Other, more capable models are feeling off and I have to put more customisation to suit my preference (that I found out I have that preference because I was testing the more capable models).
My colleague has different preferred model, because the way he prompts is also different.
And even though this behaviour is somewhat fixed by adding system prompts, it still sometimes show and it created “micro friction” when working with the models.
Unfortunately, I cannot give specific because I cannot describe it. It just feels “off”, or “wtf?!” when the model presented me with its plan.
lardsack@reddit
use claude next time
PmMeCuteDogsThanks_@reddit
Things that didn't happen
Polite_Jello_377@reddit
It’s not remotely the same thing