My old boss made a terrible decision based on AI benchmarks and turns out I was doing the same thing

Posted by Wise_Slice6303@reddit | ExperiencedDevs | View on Reddit | 25 comments

My old boss fired his entire frontend team last month cause he saw some demos and thought one backend dev could cover everything. Well 3 weeks later Im cleaning up the mess, site broken on mobile, zero accessibility, nobody knowing how anything works.

Watching him make that call based on numbers he didnt understand stuck with me. Turns out I was doing the same thing when I picked my own coding model. Ive been on GLM since 4.7, switched cause it was cheaper and worked fine. When GLM 5.1 came out it felt like a real upgrade so i stuck with it.

GPT-5.5 came out the other day so i checked SWE-Bench Pro and its 58.6 vs 58.4 for GLM-5.1, basicaly the same score. Both numbers published by the companies themselves and the pricing gap between them keeps shrinking too.

At this point idk if Im on GLM 5.1 cause its better or just cause its what i know. Same trap my old boss fell into just from the other side. Running my own tests this week cause company benchmarks mean about as much as self reported experience on a resume.

[-]

therealslimshady1234@reddit

You just have to put in more effort in guiding what you want and knowing what good output looks like.

Baby sitting an LLM is not the genius move you think it is, all the downsides of AI are still present, you are just slower than before

[-]

Altruistic-Bat-9070@reddit

Honestly anyone that is still saying AI can't make good code isn't using copilot CLI and using claude.

I think if you are starting a new codebase then sure you probably want a human to do that and to come up with the full structure for where things go etc. If you are jumping into a mature codebase and you let copilot scour through everything first though it is making some great stuff.

Also done code is > perfect code.

[-]

therealslimshady1234@reddit

Honestly anyone that is still saying AI can't make good code isn't using copilot CLI and using claude.

I lost IQ just reading this

[-]

Altruistic-Bat-9070@reddit

Well you're a web dev so it may not work as well for you, i have been less impressed with it when doing FE work.

[-]

Ok-Yogurt2360@reddit

The problem with code is that "done code" is often worse than "no code". Bad code can have costly consequences.

[-]

Altruistic-Bat-9070@reddit

It just isn’t.

I have worked in code bases with technical debt accumulated over 25 years that would make many here cry. If those had been coded by current claude the debt would be significantly lower. At least when coding in python, i have been less impressed with its FE coding.

[-]

TheBoringDev@reddit

I’m using Claude, bad to average is about right.

[-]

Altruistic-Bat-9070@reddit

Are you using 4.5 or sth

eyesopen18819@reddit

This reads like ad slop for GLM

aeroverra@reddit

I don't understand how this is still up.

Man Reddit is really starting to lose its touch with all these bots

AI is very good at producing bad to average code fast. Its essentially a tech debt machine. At best you can do maintenance with it and tiny features.

robhaswell@reddit

That's just not true. You absolutely can produce well-constructed and maintainable code. You just have to put in more effort in guiding what you want and knowing what good output looks like. Garbage in -> garbage out. A senior dev driving an AI is very different from a junior doing the same thing.

I have come to the conclusion that maybe it can but the amount of time to set it up and continue to babysit it is not much better than doing it myself and learning more things along the way.

modaldere@reddit

To be fair, I'm also good at producing bad to average code, just slower.

dEEkAy2k9@reddit

I see AI as a starting point. Another tool in your toolbox to get stuff done. Prompt what you wanna do, what might be issues and what suggestions/best practices there are. Then refine upon this.

DarthCalumnious@reddit

I don't think it's too hot a take to say that the model benchmark numbers don't have any bearing on your situation - you talk of a mess with nobody knowing how anything works and no accountability.

The model is a footnote, the challenge before you and you manager is to work out how to integrate AI into your engineering culture without it being a headless shambling train wreck. This means discipline and accountability for the work of our mindless llm minions.

Wise_Slice6303@reddit (OP)

Fair point. His problem wasn't picking the wrong model, it was thinking the model could replace institutional knowledge and teamwork. I'm not making that mistake but I am guilty of picking tools based on vibes and numbers instead of real world testing. Both of us just trusted the marketing I guess, different scale but same lazy thinking...

CodelinesNL@reddit

What the heck is this comparison even?

aymswick@reddit

Fake ass LLM slop

BlueDolphinCute@reddit

Inertia is real tho. Took me a while to admit I was sticking with my model out of habit not because I actually compared anything

Witty_Indication2017@reddit

Once the scores are this close the only benchmark that matters is your own codebase. Been saying this for a while

iiiio__oiiii@reddit

Strangely, different models have different personality and developer experiences also depends on the match between the engineer and the model. I cannot quantify it yet, but I have favourite models that just speak to me even without customisation. Other, more capable models are feeling off and I have to put more customisation to suit my preference (that I found out I have that preference because I was testing the more capable models).

My colleague has different preferred model, because the way he prompts is also different.

And even though this behaviour is somewhat fixed by adding system prompts, it still sometimes show and it created “micro friction” when working with the models.

Unfortunately, I cannot give specific because I cannot describe it. It just feels “off”, or “wtf?!” when the model presented me with its plan.