llama.cpp is a vibe-coded mess

Posted by ChildhoodActual4463@reddit | LocalLLaMA | View on Reddit | 49 comments

I'm sorry. I've tried to like it. And when it works, Qwen3-coder-next feels good. But this project is hell.

There's like 3 releases per day, 15 tickets created each day. Each tag on git introduces a new bug. Corruption, device lost, segfaults, grammar problems. This is just bad. People with limited coding experience will merge fancy stuff with very limited testing. There's no stability whatsoever.

I've spent too much time on this already.

[-]

RemarkableAntelope80@reddit

Quick releases and ticket resolution is just because it’s an active project. As for bugs, they have an absolute zoo of devices, architectures, and compute apis to deal with, also often having to deal with very technical CUDA bugs and similar. And like 10 different bits of new tech to code up. They could probably do with more test coverage in some areas, but like, generally it’s a pretty good project.

Just expect to check stuff when you update? What’s the problem?

[-]

Formal-Exam-8767@reddit

There's like 3 releases per day

Who actually reinstalls llama.cpp 3 times a day?

My installation is months old and it works, and will continue working no matter the state of repository or development. Software is not food that gets spoiled or car that needs servicing after some mileage to warrant daily updates.

[-]

ChildhoodActual4463@reddit (OP)

someone attempting to debug an issue

[-]

RemarkableAntelope80@reddit

Have you never had to bisect an issue before? I’ve not had any particular problems reporting stuff to the project, in comparison to others, and stuff seems to be fixed relatively quickly. There’s just a lot of moving parts to manage, imo they juggle lots of architectures and technical details pretty well.

And the releases are just automated I think, no different to if it was just organised by commit. What’s the actual problem?

[-]

Formal-Exam-8767@reddit

llama.cpp is a vibe-coded mess

This can hardly be considered a meaningful contribution.

[-]

Addyad@reddit

if you want to test the latest stuff like 1bit model, turboquant and stuff, it wont work with months old llama.cpp versions. So, need to add these packages with latest upstrem patches..

[-]

nuclearbananana@reddit

They literally have a rule against AI prs (and close countless ones).

I don't know why they choose to release with every commit. It does make it nearly impossible to know what's whats actually changed without scrubbing through 10 pages of releases

[-]

ChildhoodActual4463@reddit (OP)

They have a rule stating you must disclose AI use. It does not prevent AI from being used. Which I think it's fine, but judging by the amount of stuff that gets merged every day and made into a release and the amount of bugs I'm hitting. Try bisecting a bug: you hit 4 different ones along the way.

[-]

FewBasis7497@reddit

No one forces you to use mainline llama.cpp.
Some options:
1. use another inference software
2. fork it and maintain it yourself
3. create your own inference software

[-]

hurdurdur7@reddit

And how exactly will you accept pr-s from public and make sure that none of them are using AI to generate the code?

They are doing their best to filter them out. That's all. And the project is messy because the llm landscape itself is messy.

[-]

lemondrops9@reddit

Are you high?

[-]

HealthCorrect@reddit

Tried using their libllama API, god there’s no docs and the codebase is a mess.

[-]

Kitchen-Year-8434@reddit

Are we taking about llama.cpp or vllm here? Llama.cpp is my fallback when I want to drop to something that’ll just work.

[-]

Leflakk@reddit

I feel like you’re talking avout vllm

[-]

Dangerous_Tune_538@reddit

vLLM is actually decent. Code base is a bit convoluted but still well written. Only problem is lack of modifiability with their plugin APIs

[-]

Leflakk@reddit

I was more referring about stability issues, vllm (and sglang) can become a nightmare for each new release, especially when you use consumer gpus

[-]

McSendo@reddit

I mean that's not vllm's main audience.

[-]

Leflakk@reddit

I think even ampere pro gpus struggle too. Moreover, just compare the number of issues between llama.cpp and vllm repos speaks a lot (and I would bet there are a lot more llamacpp users). vllm is production grade but lacks stability in a general manner

[-]

McSendo@reddit

I have a different opinion, but what do you mean by "production grade but lacks stability in a general manner"? That sounds contradictory.

[-]

Leflakk@reddit

Yes sorry, I meant the engine is supposed to be production grade while it lacks stability in my opinion. If you use it and find it stable across each new release then I’m happy for you

[-]

EffectiveCeilingFan@reddit

Idk man works just fine for me. The docs are shit but docs are always shit.

[-]

Total_Activity_7550@reddit

Don't even spent time replying and arguing with bots, which this author 99% is. Just downvote and report.

[-]

Ok-Measurement-1575@reddit

Why would you report someone's opinion, lol.

[-]

ChildhoodActual4463@reddit (OP)

you can clean your car yourself human

[-]

4onen@reddit

Thanks, I did yesterday.

[-]

R_Duncan@reddit

ollama is derivation of it, lm studio is derivation, no other inference engine has hyalf the features and the speed of it.

[-]

ChildhoodActual4463@reddit (OP)

And that's the problem. They rush features in and introduce bugs. If at least they had a decent release process, but no, they ship a release every other commit, every day. You can't have stable software like that.

[-]

R_Duncan@reddit

You can stick with lm studio or ollama if you want just more stability.

[-]

AXYZE8@reddit

Obviously you are not aware of existence of any other inference engine.

[-]

R_Duncan@reddit

vllm do not allows moe to be 90% on cpu memory, sglang never tested. Nexa is hideous and has strange licensing.

[-]

jacek2023@reddit

Maybe you could share description of the actual problem?

[-]

Charming_Actuary3079@reddit

And what were the contributions you wanted to add, after attempting which you got frustrated?

[-]

ambient_temp_xeno@reddit

Apparently all kv quants are considered experimental in llama.cpp, so that's how it's treated. (Another reason not to use it, then.)

[-]

Dangerous_Tune_538@reddit

Why not just use another inference engine like vLLM?

[-]

twnznz@reddit

Eh, it does a thing.

I’m not part of the millionaire all-in-vram-vllm-or-you’re-a-peasant crowd (I need hybrid MoE) but granted, it behaves like crap (PP on one core, nowhere near full PCIe utilisation or QPI or memory bandwidth utilisation)..

Maybe I need to spend some time with sglang?

[-]

EffectiveCeilingFan@reddit

If you’re doing hybrid, then PP appearing to hit one core hard is expected. PP is so massively accelerated by a GPU that just transferring the weights over PCIe is faster than letting the CPU and GPU work simultaneously. That one core at high usage is just feeding the GPU data. That’s my understanding at least.

[-]

twnznz@reddit

Well shit, I'm not getting even 1/15th of PCIe saturation during PP; nor RAM. What is going on :(

[-]

Goldkoron@reddit

At this point I just made my own stable private llama-cpp build where I vibe code my own fixes to all the vibe coded problems in llama-cpp.

At least I now have:

A better multi-gpu model loader that actually allocates layers based on performance of each gpu without overloading them
Vulkan that works with better prompt processing and no Windows memory allocation issues on Strix Halo

[-]

cocoa_coffee_beans@reddit

Did you make a Reddit account just to bash llama.cpp?

[-]

iamzooook@reddit

tell me this isn't real? thats so bad. why are the maintainers merging such crap?

[-]

ttkciar@reddit

I think they overstate it. At least llama.cpp is pretty stable for me. Been using it since 2023.

[-]

iamzooook@reddit

so OPs statements are not real after all.

[-]

pmttyji@reddit

llama.cpp welcomes your Pull requests. BTW what Inference engine are you using now?

[-]

ChildhoodActual4463@reddit (OP)

There's so many tickets you can't even get help/a reply. Have you tried debugging GPU sync issues in Vulkan? Yeah, good luck.

I'm not saying anything else is better. That is not my point.

[-]

Ok_Warning2146@reddit

I think they should release stable version once in a while

[-]

ChildhoodActual4463@reddit (OP)

And that's the problem. They rush features and introduce bugs. If at least they had a decent release process, but no, they ship a release every other commit, every day. You can't have stable software like that.

[-]