llama.cpp is the linux of llm

no, anthropic uses AWS, which has its own stack and hardware under the hood. google has their own TPU stack too. OpenAI Rents/builds NVIDIA clusters, probably with sglang or vllm forks for optimized efficiency in serving

[-]

DataGOGO@reddit

Anyone running NVIDIA GPU's is running TRT or TRT LLM, not SGLANG or VLLM

[-]

cosimoiaia@reddit

Not true, a lot of providers (especially in the private sector) use vLLM as TRT is much harder to manage at scale.

[-]

DataGOGO@reddit

Source please

[-]

KallistiTMP@reddit

It's really common in small-midscale startups. Generally at scales under ~1k nodes or so. A lot of those smaller companies like it because it's pretty much a one-stop shop for both the engine and the inference serving stuff (like the KV prefix routing, request batching, spec decoding, etc), and researchers like how well it integrates really well with Huggingface. TRT is definitely the more optimized engine in terms of pushing raw tokens, but you have to pair it with Triton and take the extra steps to convert the models and all that stuff. For big companies doing large scale inference, absolutely worth it, but for a lot of startups their infrastructure team is like, 1-3 engineers, and their researchers are too busy rushing features and model quality improvements to lock in strategic clients before they run out of VC runway.

Also if I'm not mistaken I believe a lot of the Chinese labs are predominantly SGLang based. Probably because of better NPU support or something like that, GPU export restrictions play a part I'm sure.

And of course Google doing it's own thing with JAX and all that because they're opinionated enough to build their own chips and stacks tailored to them.

Source: Large scale ML infrastructure consultant in this field since before Mira's teams were experimenting with using that obscure new K8s thingy (you know, that one weird little Google project that's kinda like App Engine, the one for running stateless web apps in docker containers) instead of the more traditional Slurm HPC clusters, to train that funny little AI model that does a pretty good job of playing DOTA. On P100's, uphill both ways in the snow.

[-]

DataGOGO@reddit

TRT LLM is also a one stop shop.

Never seen an production NVidia environment, small or large, run anything other than TRT / TRT LLM. No model conversion needed, Any weights on HF are compatible.

I don't know anything about Chinese labs.

[-]

cosimoiaia@reddit

I work in the field since AI was called machine learning.

[-]

DataGOGO@reddit

Same here, never seen vLLM used in any real production.

[-]

cosimoiaia@reddit

And you represent everyone in the industry?

[-]

DataGOGO@reddit

No, nor did I claim to

[-]

Mr_Hyper_Focus@reddit

I mean, you implied it.

[-]

DataGOGO@reddit

Whatever you want to think

[-]

Mr_Hyper_Focus@reddit

Another comment contributing nothing after making a stink. Not even defending your point.

Weird behavior.

[-]

DataGOGO@reddit

I didn't make a stink, I stated a fact and moved on.

[-]

Woof9000@reddit

Yeah bro, vLLM wasn't really a big thing only few years back, but it's more or less industry standard now and been so for at least couple years now. Perhaps only excluding handful major top dogs, who have resources to develop and maintain their own stacks, and very niche, very specialized and academic research groups and companies, who's focus on very specific ML tasks and not on serving LLM's.

[-]

DataGOGO@reddit

If you are running Nvidia GPU’s there is absolutely no reason not to run TRT LLM, it is so much faster and more efficient

[-]

Woof9000@reddit

For one, not everyone is running on nvidia hardware, but even on nvidia HW you'd normally still want to have at least an option of some flexibility, for potential future diversification, even if in reality you never going to do that. But that besides the point, the main difference between those is the scale. vLLM kinda dwarfs TRT by the amount of development that goes in to it, by the feature list, compatibility, speed of implementation and support for new features, and etc and so on.

[-]

DataGOGO@reddit

Hense, why I said “if you are running Nvidia GPU’s”.

[-]

Woof9000@reddit

All the major inference engines/frameworks are nvidia-first, except perhaps for platform-agnostic llama.cpp, but that one is not competing in the enterprise space and large scale deployments.

[-]

Jumpy_Fuel_1060@reddit

I manage a pipeline on GCP that uses a mix of Nvidia A100s, Nvidia L4s and Google v5 TPUs. We use the LLMs for a mix of translation and advanced NER extraction. We use vLLM for consistency and TPU support.

[-]

Healthy-Nebula-3603@reddit

Current implementation?

Nope

[-]

cosimoiaia@reddit

Yeah llama.cpp is getting a lot of AI-aided code but very few oss projects aren't but your point doesn't really make sense since the improvements you talk about are in the models not in the inference engine.

[-]

Healthy-Nebula-3603@reddit

Few ?

Text , multimodal , audio , etc

Is handling almost all models up to 1T parameters except closed source.

[-]

inky_wolf@reddit

Source?

[-]

DataGOGO@reddit

production environments everywhere.

[-]

SlaveZelda@reddit

I know some labs use vllm to run their models but pretty sure noone is using llamacpp.

Llamacpp saves so many resources is optimised for all kinds of hardware (no other inference engine supports vulkan let alone obscure cpu architectures). However labs want to iterate over their stuff easily, preferably in python and they want to serve at scale - hence the graviotation towards sglang or vllm.

[-]

DataGOGO@reddit

No, no one serious uses llama.cpp, it is only used by local llm hobbyists, not anyone doing any real production.

Most real production is done entirely on the nvidia stack (TRT / TRT LLM)

[-]

s101c@reddit

Providers have to serve thousands of users at the same time. Llama.cpp isn't really efficient or suitable for this kind of usecase.

[-]

stddealer@reddit

They most likely use something like VLLM, it scales much better when serving many users.

[-]

guiopen@reddit

They do not use vllm. They are mostly for single user

[-]

Pretend-Pangolin-846@reddit

They who? All the above examples use VLLM. VLLM is for high server grade throughput. LLama.cpp is the king of edge device inference. Minimum overhead, maximum performance.

[-]

stddealer@reddit

They the remote inference providers, or anyone who has to process request for more than a coulple users agents at once.

[-]

DataGOGO@reddit

TRT LLM

[-]

idiotiesystemique@reddit

The wrapper is still useful. The most popular Linux distros are not the most powerful ones, they're the less pain in the ass ones.

[-]

false79@reddit

As someone who uses linux, windows, and MacOS, this comparison doesn't make any sense at all.

[-]

DevelopmentBorn3978@reddit (OP)

I use exclusively linux since 1995 other than commercial unixes

[-]

false79@reddit

But how do you even draw the conclusion between a complete operating system verses an LLM inference system?

[-]

DevelopmentBorn3978@reddit (OP)

of course it's not because of the specific technicalities that distinguish one of these projects from the other. it's more because both have been forged by hobbists (with strong understanding of the field of course), meant to be used also by hobbists, taking by storm the hobbists community that since than started babbling to llms as earlier started using opensource. It's about the development model targeting openess and personal use first, becoming what both have become by being battle tested almost realtime on the field by a large moltitude of heterogeneous variegated assortments of hw/sw/intents

[-]

Randomdotmath@reddit

just no, why are you trying this?

[-]

false79@reddit

...it's because I know the difference between the two?

[-]

sob727@reddit

This proves that using Linux for a long time doesn't mean you're able to draw comparisons that make sense.

Also a Linux user since the 90s, and llama.cpp/vllm user.

[-]

DinoAmino@reddit

The upvotes don't make sense. The comments are full of hyperbole from people who don't know anything other than GGUF and not much about that either. This sub is becoming more and more low tech everyday.

[-]

YOU_WONT_LIKE_IT@reddit

Just think Reddit is used for training data.

[-]

DinoAmino@reddit

We are doomed and I don't like it. Lol. Username checks.

[-]

false79@reddit

Agreed. It is a good thing that more and more people know about the tech though.

But misinformation, that's where I draw my line in the sand.

[-]

KallistiTMP@reddit

vLLM is the Linux of LLM's. Llama.cpp is the FreeBSD, with a strong niche hobbyist following.

Don't get me wrong, it's a great ecosystem, but the reason it's popular is that it's heavily optimized for the GPU-poor.

It's great when it comes to duct taping together 2 P40's, an old rusty RTX 3060, and 256 GB of mixed DDR4 ram you managed to scavenge off of old gaming desktops, in Q3 quant with layers split all the various components of your franken-server. It does that sort of thing way better than vLLM does.

What it doesn't do very well is any sort of scaled deployments. vLLM is king there, and what most commercial deployments run on. They're both good in their respective domains, but you will not typically find anyone using Llama.cpp outside of hobbyist circles.

[-]

DevelopmentBorn3978@reddit (OP)

it looks to me like it is the other way around: llama. cpp -> linux, vllm -> bsd. Anyway we're living in the early days of this next exciting revolution thanks to the efforts of those bright minds

[-]

exceptioncause@reddit

nope, llama.cpp is too straightforward to be a linux, vllm is a proper linux madness

[-]

ImpressiveSuperfluit@reddit

Tell that to the unimaginable horror show that was me trying to get it to run on Fedora. I have so much ptsd that I don't even remember the details anymore, but I ended up needing to crawl through the deepest of archives to find a fix for a specific cuda file. The fix wasn't a particularly big deal, finding it was a nightmare, though. And that's all after I already had to fight myself through a bazillion weird version mismatches. Only to then just throw it all away the second it finally booted, because the thought of dealing with start parameters now just gave me an aneurysm. Yea, it beat lmstudio by some 15% or so but fuuuuuuuck all of that.

[-]

the__storm@reddit

Now do ROCm.

[-]

My_Unbiased_Opinion@reddit

i agree here. I want to use VLLM but im way to casual for it.

[-]

Infninfn@reddit

Llama.cpp is the Ubuntu to vllm’s Red Hat Enterprise Linux.

[-]

Ok_Warning2146@reddit

llama.cpp is particularly shines in the edge devices using non-CUDA backends as there are no viable competitors there. For CUDA, there are many competitors.

[-]

caetydid@reddit

more like the debian of llm

[-]

LinkSea8324@reddit

No, it's windows on the contrary.

Much easier to setup, used as desktop for everyday casual users.

But when you go pro (server) you actually use VLLM.

[-]

Pristine_Pick823@reddit

In that rationale, is Ollama Ubuntu?

[-]

srigi@reddit

Oolama is Windows 11. Essentially closed soft with propertiary data files and not utilizing established open formats, breaks with every update.

[-]

mtmttuan@reddit

Doubt llama.cpp is even remotely comparable to vLLM, SGLang,... (actual deployment engines for scale serving). Llamacpp and the whole gguf ecosystem is pretty much a spin off from pytorch for individual local hosting.

If anything pytorch is the linux of pretty much everything deep learning including llm.

[-]

maycomesinlikealion@reddit

Okay. You forced me. I’ll teach.

Before we begin, I strongly rebuke you for your condescending attitude, and even worse you are right in all the ways that matter none and wrong in all the ways that matter some.

Let’s get a few things cleared up

PyTorch is a library of source code. You can install it to your computer so that it when you write and run code to program your computer. ONE, of MANY POSSIBLE REASONS A DEVELOPER MIGHT USE PYTORCH, is for training a neural network, for example. Not only does PyTorch support robust primitives for the entire ML engineering ops and research communities (indeed having pioneered all 3 to some degree as a leading public face for a long time on AI at Meta was Soumith Chintala [other than Yann of course, neither are at Meta anymore ahead] the creator of PyTorch), PyTorch as we all know beautifully automatically manages industry-ready settings for your personal machine so that you have a better experience over all, and they implement custom data structures specifically for data science.

Why? Because, the same reason why this commenter is an idiot, because computer memory is allocated before the program runs, meaning structuring your train dataset, for example, by stepping out all the empty rows, saves hours on even consumer-grade runs. It’s like the same thing from vanilla computer science where allocated a seemingly unrelated data to a variable before or after a recursive call can increase the upper bound on its worst case by 20 or 200x because efficiency is never about the theoretical first try but always how you compound operations through working hard. Or if you want to be like u/mtmttuan, hardly working, apparently.

So putting it all together, the CPU engaged in its familiar fetch-decode-execute LOOP has a bunch more possible instructions, and which if executed transforms your unstructured user data into structured and addressed data , which makes your computer work better for you. u/DevelopmentBorn3978 OP should read this too to understand why this is fucking wrong as fuck. and this guy is a prick to boot. unbelievable.

Meanwhile, Llama.cpp is a program that implements source code and commands CPU + memory on the host device as a OS-based operation.

Only of these objects hold live state in ALL runtimes they participate in. It’s definitely not PyTorch. You’re just a prick.

Llama.cpp might not be the Linux, per se, BUT? It’s a hell of a lot more functional in an OS context than fucking PyTorch. Don’t make me fucking laugh. Someone that actually knows computers is your worst nightmare. Think about that next time you say something so arrogant in nature AND disrespectful to its intended audience. Wow. Good for you man.

[-]

MuDotGen@reddit

I've only ever used Ollama or llama.cpp. I'm open to trying anything if it means potential gains. I've heard about vLLM a lot and saw SGLang too, but I'm not familiar at all yet. Why do you like? Do they have better GPU backend support? What's best use-case over llama?

[-]

screenslaver5963@reddit

vLLM is better if you’re serving multiple requests at once. If you’re running a model or 2 locally and just chatting with it by yourself or a couple people on the network, llama cpp is fine. If you’re running a server for a lot of people or running multiple agents simultaneously then vLLM is better.

[-]

Pablo_Offline_AI@reddit

Absotutely

[-]

Worried-Squirrel2023@reddit

more like the kernel. ollama is the ubuntu, lm studio is the elementary OS, koboldcpp is arch.

[-]

mecshades@reddit

I agree with the Ubuntu analogy. Ollama does its best to try and hide that it's llama.cpp underneath. Others proudly rep it. vLLM is like llama.cpp but not, so I'm going to say that vLLM is more akin to RHEL (as it relates to scalability and enterprise) or maybe even some kind of BSD (as it's just a very different thing).

[-]

combrade@reddit

That’s unfair even to Ubuntu . Ollama does not have basic features like off loading experts for MoE models which even lmstudio had since last year . They regularly are six months to a year behind model support . Ubuntu is buggy and has extra fluff but functionally they have everything feature wise . If anything Cuda support is way better on Ubuntu .

[-]

Pretend-Pangolin-846@reddit

VLLM is powering the AI data center giants whereas llama.cpp is powering our local AI models.

Not really a good comparison, if anything, llama.cpp is Windows of llm and VLLM is the linux.

[-]

cosimoiaia@reddit

Not really, no. You can make some parallels regarding the origins as Linux was simply an open source clone of Unix written from scratch to run on x86 hardware and llama.cpp is a version of an inference engine made to run on the first Apple silicon. But the similarities end there. Linux became almost instantly the backbone of the internet (thanks to the combined efficiency of Apache). Llama.cpp is what you use when you want to run the latest model/architecture in the fastest way on a combination of consumer hardware. So they are kinda the opposite in this way. Still llama.cpp is the OG and made possible for a lot of people to run llms that would otherwise only dream of it. Also llama.cpp is technically easier to run than its equivalent TRT or SGLang or vLLM while Linux is harder to manage than macos or windows.

[-]

razorree@reddit

no. llama.cpp is just an engine, and there is a lot of open source engines and compute libraries (all of them are open?), Transformers, PyTorch, Numpy, etc. so, no :)

[-]

CrispyBiscuitsAI@reddit

Linux is a kernel. llama.cpp is a virtualization runner.

[-]

throwaway275275275@reddit

More like VLC

[-]

PromptInjection_@reddit

For single user. For multi user/connections it's vLLM.

[-]

b3081a@reddit

It is. But being similar to Linux doesn't guarantee its eventual success. Early versions of Linux was just Linus' toy until Intel, Android and lots of Internet tech giants started contributing to the ecosystem. If llama.cpp can fit into some future form of business model, it will likely end up similarly.

[-]

El_Mudros@reddit

Absolute madness of a comment there. Linux already a well-established, serious thing even way before the dotcom boom, that is, by the late 90's.

There is more to computing than desktop machines -- and even back then in desktops Linux was very much a thing among the people in the know, e.g. compsci students and other computer hobbyists, a good decade before anyone had ever heard about Android, say.

[-]

VickWildman@reddit

Nope.

Linux is the backbone of most servers and devices. Most people don't interact with it directly, they don't even know it exists, but they sure rely on it on way or another.

Llama.cpp is just something people run locally on their computers. It's not a part of any infrastructure, it's not used on devices, there are specialized runtimes for that. It's the most generic runtime, it supports all kinds of compute, graphics and web APIs. Only enthusiasts ever use it, a few hundred thousand people on the entire planet.

[-]

IORelay@reddit

Linux is the most popular consumer grade OS, the true king of OSes: Android.

[-]

VickWildman@reddit

Yeah, but as I said most people don't interact with it directly, 9 out of 10 Android user wouldn't be able to tell it's there at all. Linux is invisible. Billions of people rely on it, nobody ever heard of it, unless you are into technology of course, which we all are.

[-]

DataGOGO@reddit

absolutely not.

[-]

IORelay@reddit

Well, Llama cpp users are not obnoxious.

[-]

Ok-Measurement-1575@reddit

It's more like the Windows of llm tbh?

Simple, fast, ubiquitous.

vllm is more love like the early days of Linux right now.

[-]

charmander_cha@reddit

Qualquer coisa comparada ao Windows parece uma ofensa

[-]

datbackup@reddit

if vllm were written in a compiled language i would be more inclined to agree

It’s 88% python according to github

I can’t see llama.cpp as windows no more how hard i squint

What’s windows is chatgpt.com

To say there is a “windows of self-hosted ai” is a contradiction in terms imo

[-]

Ok-Measurement-1575@reddit

If you have only 5 minutes to test your first llm, I guarantee you are not seeing any text generation from vllm in the same way if you only had 5 minutes to listen to any mp3 on Windows 98 or Slackware 4, I guarantee you ain't hearing any sound in those 5 minutes from Slackware.

[-]

Limp_Classroom_2645@reddit

Ollama is windows of llm

[-]

Ok-Measurement-1575@reddit

ok

[-]

andy2na@reddit

No, ollama is the windows of llm

[-]

Ok-Measurement-1575@reddit

Ollama was llama.cpp for 95% of it's life.

[-]

DevelopmentBorn3978@reddit (OP)

I find it the real reason behind the massive growth of the llm users and claws and also the base for the from now on untakable right of personal ai as opposed to the mostly proprietary/cloud only (business) models forced into society