I got tired of compiling llama.cpp on every Linux GPU

Posted by keypa_@reddit | LocalLLaMA | View on Reddit | 41 comments

Hello fellow AI users!

It's my first time posting on this sub. I wanted to share a small project I've been working on for a while that’s finally usable.

If you run llama.cpp across different machines and GPUs, you probably know the pain: recompiling every time for each GPU architecture, wasting 10–20 minutes on every setup.

Here's Llamaup (rustup reference :) )

It provides pre-built Linux CUDA binaries for llama.cpp, organized by GPU architecture so you can simply pull the right one for your machine.

I also added a few helper scripts to make things easier:

detect your GPU automatically
pull the latest compatible binary
install everything in seconds

Once installed, the usual tools are ready to use:

llama-cli
llama-server
llama-bench

No compilation required.

I also added llama-models, a small TUI that lets you browse and download GGUF models from Hugging Face directly from the terminal.

Downloaded models are stored locally and can be used immediately with llama-cli or llama-server.

I'd love feedback from people running multi-GPU setups or GPU fleets.

Ideas, improvements, or PRs are very welcome 🚀

GitHub:
https://github.com/keypaa/llamaup

DeepWiki docs:
https://deepwiki.com/keypaa/llamaup

[-]

ceene@reddit

Enter search query (or press Enter for popular models): -> Searching HuggingFace for GGUF models... Error: Invalid JSON response from HuggingFace API

[-]

keypa_@reddit (OP)

Should be fixed now! Let me know if it works on your side :)

[-]

ceene@reddit

However...

Select a model to download: 1) NousResearch/Hermes-3-Llama-3.1-70B-GGUF 2) NousResearch/Hermes-3-Llama-3.1-8B-GGUF

Enter number: 2 Model: NousResearch/Hermes-3-Llama-3.1-8B-GGUF [!] No GGUF files found in this model

[-]

keypa_@reddit (OP)

I will look into that to fix it asap !

[-]

This is actually super relatable. The “compile for every GPU” loop gets old fast, especially once you start juggling multiple machines or setups. We ran into something similar before and leaned more into speeding up the build layer itself (distributed builds, that kind of approach, tools like Incredibuild fall somewhere in that space). It helped a lot when recompiling across environments.

That said, prebuilt binaries like this are honestly a huge quality-of-life improvement. Curious how it holds up across less common GPU configs though.

[-]

keypa_@reddit (OP)

Thanks, really appreciate that 🙏

Yeah that loop gets painful fast once you start juggling machines 😅

Your approach with speeding up the build layer makes a lot of sense too — especially in more stable environments. llamaup is kind of the opposite direction: trying to skip the build entirely when possible.

Coverage on less common GPUs is still growing, but that’s definitely something I want to improve over time.

Out of curiosity, were you working with fairly fixed infra or more dynamic setups?

[-]

Even_Package_8573@reddit

Mostly fixed setups on our side, which is probably why focusing on the build layer worked pretty well. Once things got more dynamic it definitely started to break down a bit. Skipping the build entirely sounds really nice though, especially for spinning things up quickly.

[-]

Much-Farmer-2752@reddit

wasting 10–20 minutes on every setup.

What kind of lame CPU do you have?

[-]

Medium_Chemist_4032@reddit

I never really measured, but last time I compiled it felt like 10 minutes on a i9 10900k

[-]

ProfessionalSpend589@reddit

Your CPU is bad. My i7 8700 is slower than my i3-1315u :)

And measuring is easy with:

time make -j $(nproc)

I already have it ingrained as a habit. I do it for llama-bench to, because it measures how long the model was loading too (short and fast context will not take much time, so loading will dominate more).

[-]

Medium_Chemist_4032@reddit

Yeah, due to unrelated reasons, I replaced that rig with an old threadripper, so you might actually be right

[-]

Much-Farmer-2752@reddit

old threadripper

Bet it is WAY faster in building stuff.

[-]

Medium_Chemist_4032@reddit

Welp, nothing earth shattering really - still in 5m range:

Built llama-swap/llama-cpp:b8334 in 5m39s on the TR40 3970X (cmake build step was 324s / ~5.4 min). The compilation itself (-j$(nproc)) saturates all cores so this is a good benchmark of raw parallel compile speed.

For reference — the build breakdown:

apt-get + deps: ~14s
git clone (shallow): ~2s
cmake configure + compile: 324s
image export: 1.3s
Sources:

llama.cpp releases — b8334 released 2026-03-13

[-]

ProfessionalSpend589@reddit

Intel improved their architecture a lot a few years back.

I was impressed when I tested an Intel N100 (I like cheap celeron class processors) with 4 cores which in a ray tracing test was equal to my i7 8700 with 12 threads (6 cpus with hyper threading) with lower turbo clock. For a fraction of the power too. And it was a cheap Chinese mini pc - one of those cubes.

And of course, anything newer is a lot better.

[-]

Much-Farmer-2752@reddit

Just tried - 90 seconds on 9950X w/o SMT.
I have also a 64c EPYC, but it won't be a fair play :)

[-]

keypa_@reddit (OP)

Haha, not super ancient CPU.

I'm counting in the time when you hop between instances to build and compile everything. On my side most of the time i'm near 10 to 12 minutes but sometimes i'm getting closer to 20 minutes when I get a lower number of cores available for the instance.

[-]

chris_0611@reddit

Bruh, that's not normal. I compile llama-cpp with cuda in maybe a minute or so.

[-]

Lorian0x7@reddit

Why not just use Vulkan binary files? I'm using that and the speed seems to be the same in line with the expectations for cuda on my gpu.

[-]

keypa_@reddit (OP)

Yeah Vulkan works surprisingly well in a lot of cases I agree with that.

llamaup is mainly focused on CUDA setups because many people are running llama.cpp on NVIDIA GPUs still prefer CUDA for things like:

- Slightly better performance on some models

- Wider testing/usage in the CUDA backend

- Compatibity with existing CUDA-based workflow

So the goal wasn't to replace Vulkan build, just to make CUDA deployments on Linux easier when moving between machines or GPU architecture.

If Vulkan works well for your setup though, that's definitely a good option too.

[-]

jacek2023@reddit

install ccache and each build will be quick

[-]

keypa_@reddit (OP)

Yeah, cccache definitely helps for repeated builds 👍

llamaup is solving a slightly different problem though — it avoids building at all when you’re setting up a new machine or different GPU architecture. Instead it just detects the GPU and pulls a ready-to-run binary.

So if you’re hopping between machines or provisioning nodes, it becomes more of a pull workflow instead of compile (even if cached).

[-]

StardockEngineer@reddit

Why do any of that? Seems to make no difference

[-]

keypa_@reddit (OP)

Yeah, for a single machine or GPU type it probably doesn’t matter much.

Where llamaup helps is when you’re switching between multiple GPUs, machines, or new releases — instead of rebuilding for each SM version every time, it auto-detects the GPU and pulls the right binary.

[-]

StardockEngineer@reddit

I have like five different GPUs types across five machines.

What if I have a machine with multiple GPU types? Cause I have that, too

[-]

keypa_@reddit (OP)

Good question.

Right now the idea is per-machine deployment: the script detects the GPU architecture on that machine and pulls the matching build. That covers most setups where each node has a single GPU type.

If you have multiple GPU architectures in the same machine, you’d probably want either:

a multi-arch build (CMAKE_CUDA_ARCHITECTURES="...")
or separate binaries for each SM and run the appropriate one

llamaup is mainly trying to simplify the “new machine → run once → ready” workflow rather than every possible CUDA configuration.

That said, heterogeneous multi-GPU systems are interesting — I might add a mode that downloads multiple builds if multiple architectures are detected.

[-]

MelodicRecognition7@reddit

useless project, there are official precompiled ROCM and Vulkan builds by llama.cpp team which are more preferable than random binaries from unknown user, and people who have a Nvidia card could compile a CUDA build in just a few minutes, not 10-20.

[-]

keypa_@reddit (OP)

That’s a fair point.

Official llama.cpp releases do provide ROCm and Vulkan builds, and if you’re running on a single machine compiling for CUDA is definitely doable.

llamaup is mainly targeting a slightly different use case: Linux CUDA setups across multiple GPU architectures or machines where you end up rebuilding repeatedly for different SM versions.

The goal is just to turn that workflow into a quick detect + pull instead of rebuilding each time.

Also worth mentioning: everything is open source, and the build script used to produce the binaries is in the repo so people can reproduce the builds themselves.

If it doesn’t fit your workflow that’s totally fair — but it’s already saving some time for people hopping between different GPU machines 🙂

[-]

LoafyLemon@reddit

You've literally fallen from the sky to save me. I was just bitching about it yesterday. xD

[-]

keypa_@reddit (OP)

Haha perfect timing then! Glad it's usefu! That exact frustration is basically why I built it. Enjoy !

[-]

czktcx@reddit

Just specify multiple cuda architecture and build at once, why make things complex...

CMAKE_CUDA_ARCHITECTURES="75;86;89"

[-]

keypa_@reddit (OP)

Yep, that’s totally valid poitnt and works well if you know all the CUDA architectures in advance and don’t switch machines often.

llamaup mainly targets the situation where you’re hopping between multiple machines or GPUs, or dealing with new releases — you don’t have to remember all the SM numbers or rebuild. It just detects the GPU and pulls the right pre-built binary automatically, saving time and headaches.

[-]

czktcx@reddit

I do switch between different machines, but I just compile once for all (my) cuda archs on single machine, so I don't need to set up compile environment on every targets.

Binary stored on NAS and simply runs the binary(and even cuda runtime binary) from remote(use zfs snapshot if need versioning).

[-]

keypa_@reddit (OP)

Are you guys compiling on every machine or using some sort of shared build system?

[-]

ProfessionalSpend589@reddit

For my 2 Strix halo - I compile on one in a separate directory and then I just copy that directory.

For my intel pc I copy the source directory and do another compilation.

I don’t have the money for more computers, so that would be the most complex setup I’ll have for the next year or two. :)

[-]

Haeppchen2010@reddit

Check out ccache to speed up the C/C++ part of the rebuild.

[-]

keypa_@reddit (OP)

Will do thanks !

[-]

qwen_next_gguf_when@reddit

I build once and just package the build folder.

[-]

keypa_@reddit (OP)

Indeed, that works if you only deal with a single GPU type.

Are you mostly running on a single GPU type, or do you switch between multiple architectures?