Self-hosting LLaMA: What are your biggest pain points?
Posted by Sriyakee@reddit | LocalLLaMA | View on Reddit | 85 comments
Hey fellow llama enthusiasts!
Setting aside compute, what has been the biggest issues that you guys have faced when trying to self host models? e.g:
- Running out of GPU memory or dealing with slow inference times
- Struggling to optimize model performance for specific use cases
- Privacy?
- Scaling models to handle high traffic or large datasets
Direct_Turn_1484@reddit
I need more fucking VRAM. But for less than $80k.
ttkciar@reddit
AMD MI60 gives you 32GB of VRAM for $450.
vibjelo@reddit
Pro 6000 is only about ~10K (YMMV), about 8 times cheaper :)
Direct_Turn_1484@reddit
Yeah, but then I gotta buy a machine that can handle plugging it in.
stoppableDissolution@reddit
Thats another 1k or even less?
neoneye2@reddit
Need models that support structured output.
Qwen and Llama are good at it.
ttkciar@reddit
All models support structured output, if your inference stack supports Guided Generation (like llama.cpp's grammars).
Double_Cause4609@reddit
Ecosystem fragmentation.
LlamaCPP has a great feature set and compatibility...But isn't the fastest backend.
EXL3 has great speed and the best in class quality data format... But has limited model and hardware support.
vLLM is great and has the best speeds probably...But has asymmetric support (some things are supported on one model type but not another. AWQ quants are supported on CPU...But not for MoEs, etc), and doesn't support hybrid inference. And the ecosystem surrounding quants is hard to use and know which project is the right way to handle each quantization type. vLLM also has terrible samplers.
Aphrodite engine has great samplers...But doesn't have every feature vLLM does and doesn't support all the same models...But also has its own unique features that are super awesome, and is a crazy fast backend, still.
KTransformers is awesome...But some people report difficulties getting it running, and it could stand to borrow some tricks from AirLLM to work more like LlamaCPP's efficient use of Mmap() for dealing with beyond-system-memory loadable models.
Sparse Transformers and Powerinfer are great projects, but don't have an OpenAI endpoint server to call. Otherwise they'd be a great way to improve what's available to end users on consumer hardware, possibly making 70B reasonably accessible.
Tbh, to me, it feels weird that so many different backends are being maintained. They're all one or two features different for one another, or are all maintaing a lot of the same code but for different file formats or quantization types.
It'd be really cool if there was a unified quantization format that everybody agreed to support in the lower bit widths (perhaps ParetoQ?) so that there was a common target, and everyone could target it with their own quantization logic, be it QAT or closed form solutions (like EXL3 or HQQ).
I also think the next major frontier is probably sparsity. We're starting to see sparse operations on CPU and GPU, and projects that only need to load the active parameters instead of loading all the parameters in a layer, meaning that parameter can be streamed from storage instead of memory, decreasing total memory requirements and execution time (see: "LLM in a Flash"), and it'd be nice to see more unified support for that, though we are starting to get some. I think it'll result in a really big split between CPU and GPU backends, though, because the strategies optimal for one won't be for another.
MaverickSaaSFounder@reddit
I guess most of this is largely resolved if you use an end-to-end model orchestration platform like a simplismart.ai or a modal.com
Double_Cause4609@reddit
...How...?
They don't really "solve" it; they hide the fragmentation behind a curtain and you just trust that you're getting the best possible results.
I guarantee they don't have some spare fork of vLLM with EXL3 support, or support for sparsity (that's not already in vLLM) or anything else.
They're a money pit for people who don't know how to deploy models.
MaverickSaaSFounder@reddit
Based on what the Simplismart guys mentioned in their NVIDIA GTC talk, they have done a ton of optimisations on app serving layer, model-chip interaction, and several model compilation/caching/kernel usage type stuff by making components a lot more modular. Not sure about Modal.
So it is obviously not about hiding behind a curtain, MLEs are not fools at the end of the day.
Double_Cause4609@reddit
Yes, but that doesn't solve the problem I have.
Here is my problem:
I have a given amount of compute, bandwidth and memory.
I want to get the best results out of that for personal hobbyist use.
I have a limited budget, most of which has been allocated for the long term in the form of hardware which I own.
The rest is set aside for fine tuning in the cloud (ie: for QAT models to get better value out of my local hardware.
Nowhere in there, is room for a closed source piece of software, (which has an "ask us" license, which typically means beyond consumer reach) which doesn't integrate (to my knowledge), with backends that I'm familiar with and fullfill my needs in terms of user-facing features (such as samplers, model selection, etc), and also, isn't necessarily guarenteed to give me the best value for *all* of my available resources (which includes CPU compute and system memory, or perhaps even the best quality per unit of VRAM), because they do not necessarily have support for small, open source quantization formats that offer the best possible value per gigabyte of VRAM (like EXL3).
They also likely do not support many of the bespoke optimizations and tensor layouts I use for large MoE models doing hybrid inference fully utilizing my CPU and GPU together in LlamaCPP and KTransformers.
I'm not sure why you're even bringing it up. It's not relevant to r/LocalLLaMA
It really only matters (maybe) for major enterprise use.
vibjelo@reddit
I think it's way too early for this. I've been experimenting with them for years at this point, but we're still doing huge strides in improvements from time to time, and ossifying the stack at this point, would be it order for those new ideas to penetrate the ecosystem effectively.
Generally, being brought up by FOSS development basically, I feel like the plurality, available choices and people experimenting in all sorts of directions being a good thing, as we're still in the exploration phases of what's actually possible, and what isn't.
Generally when exploring a space like that, you want to fan out in various directions before you start to "fan in" again to consolidate the ideas, I think that is what's happening right now too, and I'm not sure it's a bad thing.
Double_Cause4609@reddit
I mean, ParetoQ has basically established the design space for traditional quantization formats, and as far as we can tell, they've hit the limit of what you can do with a given BPW in a format that's still familiar to how we've handled quantization up until now.
Granted, that's exclusively for QAT...
...But, I know about a few rumblings going on in the background and QAT's going to be a lot more accessible by the end of the year. And also, I think that EXL3 actually performs more like QAT than traditional quantization algorithms. There's still some room to go yet, but Turboderp apparantly has some ideas to close the gap further.
Anyway, I don't think that we should limit ourselves to a specific format, but as you get into smaller bit widths, there's really only so many ways to package weights, and the difference between the target formats between all the quantization techniques are really not that big.
Honestly, we may as well at least have one universal format, and let everyone bring their quantization algorithm to it. Particularly for super low bit widths (ie: Bitnet 1.58 would be fine, 2bit, 3bit, etc) one of them should just be supported by everybody, IMO.
I do think that there probably is still room to get more information in the same amount of data...But it's going to look really weird. It'd have to be something like a hierarchical blockwise quantization format (maybe a wavelet quantization...?) that used log(n) data or something to somehow encode each weight in less than 1 bit.
vibjelo@reddit
If I had a penny every time I heard this in machine learning or even computing. "This is probably the best it'll ever get" is repeated for everything, and every time they claim "But it really is true this time!" :)
The ones who live will see, I suppose. Regardless if the ecosystem become more consolidated or more spread out, there will be a bunch of interesting innovations. Lets hope we use them for good things at least :)
KnightCodin@reddit
Well said! While EXLV3 is the new kid on the block, you can always use EXLV2 - very good balance of wide-spread support and speed. If you want to get your hands dirty and engineer a true MPP (massively parallel processing using Torch MP or Ray) then you can have a real impact.
Southern_Ad7400@reddit
relevant xkcd
Marksta@reddit
Yea you hit the nail on the head, bunch of open source inference engines running in different directions by about 1 degree, remaining incredibly close to all of one another but just far enough off to be incompatible and totally different projects.
Ik_llama.cpp is probably the one that hurts the most, forking off of llama.cpp just slightly to get CPU performance boosts but now lose all the GPU related stuff and other niceities in main line.
Double_Cause4609@reddit
ik_lcpp also has new quantization optimizations that appear to improve quality noticeably, as well, which makes it extremely unfortunate.
MDT-49@reddit
Yeah, this!
You already mentioned it to some degree, but the hardware optimization/libraries from vendors that are all over the place.
E.g. Ampere has a llama.cpp fork and quant that's optimized for their CPUs, but the llama.cpp version they use is from 2024 or so which makes it impossible to run newer LLMs like Qwen3.
AMD has it's ZenDNN library, but as far as I know, there isn't any support for llama.cpp, the (I think) de facto engine to run LLMs on CPU only. Although maybe it's possible to build llama.cpp with AOCL-BLAS.
Shout out to ARM though for their native Kleidi support in llama.cpp. I must admit that I haven't thoroughly researched how Intel is doing in this area.
Running a MoE as efficiently as possible using hybrid GPU/CPU system using one open/standardized platform supported by all hardware vendors would be the dream.
Superb123_456@reddit
VRAM!
MoffKalast@reddit
Is it just me or does llama.cpp do this incredibly annoying thing where it only extends context cache when it gets too small even with no-mmap?
Often times I can load a model at 32k like no problemo and then 10k actual tokens in I go out of memory like what the fuck. I wish there was a flag to just force allocate the whole context buffer at the start .
stoppableDissolution@reddit
I believe its not kv kache, but pp process cache. Try reducing blas batch size.
MoffKalast@reddit
That does help a bit, but going under 256 gives pretty slow results and it still grows out of proportion, just slower.
nomorebuttsplz@reddit
with M3 ultra, I have plenty of memory, the bottleneck is prompt processing and waiting for LM studio to update its engines to improve prompt processing.
Kuane@reddit
What's the alternative to using LMstudio? I am on a M3 ultra too.
No-Consequence-1779@reddit
Lm studio has an api using open ai standards. Make your own interface if you need something specific.
vibjelo@reddit
Having a different UI/interface won't affect how quickly/slowly LM Studio (llama.cpp actually) processes the prompts...
No-Consequence-1779@reddit
Actually, the various LLM servers and thier interface can matter quite a bit.
Ollama, lm studio, vllm, and others can use very different processing and rendering methods.
For LMStudio, the integrated GUI is a single thread so it commonly becomes a bottleneck.
vibjelo@reddit
Its the backend/runner/inference server that matters, not the UI/interface... LM Studio uses llama.cpp for inference.
No-Consequence-1779@reddit
Well, through some number up if it’s that important to you. You totally do not understand what single thread means so a technical discussion is pointless.
vibjelo@reddit
lol, I do, but you think a LLM chat UI needs to be multi-threaded, so I equally feel like continuing the discussion would be pointless.
No-Consequence-1779@reddit
Probably someone with gui programming experience would immediately see this. Discussing gui programming 101 with a layman is pointless. This is why async is the standard for all blocking calls. It is the most basic standard.
nomorebuttsplz@reddit
Not sure... you can use something like MLX directly in a CLI but I like the convenience of LM studio.
mxmumtuna@reddit
I’m an Apple fan, but this particular problem is out of its wheelhouse, which is to say, GPU compute. It just doesn’t compare to workstation or even desktop-class compute. Bandwidth is 👍👍 though.
nomorebuttsplz@reddit
Perhaps someday there will be a bridge between them like this: https://www.reddit.com/r/LocalLLaMA/comments/1kj7l8p/amd_egpu_over_usb3_for_apple_silicon_by_tiny_corp/
Crinkez@reddit
An all-in-one GUI only app that doesn't have anything to do with Python, has a simple .exe to install it, and can do everything, without requiring api's to other local apps just to get things done. Oh and it should be open source as well.
Eisenstein@reddit
Looking for a new app idea? I would be less cynical if all of your posts weren't self-promotion.
Red_Redditor_Reddit@reddit
Memory. I think thats everyone.
mxmumtuna@reddit
Particularly VRAM.
IrisColt@reddit
Particularly the VRAM that is inside the dedicated GPU. 🤣
lolzinventor@reddit
Particularly VRAM on the GPU that's local to the node with the PCI bus transferring the data.
giantsparklerobot@reddit
The poison, the poison for Kuzco, that poison.
Wait sorry wrong line, I'll come in again.
needCUDA@reddit
I have 5x GPUs spread across 3x servers. I want something like ollama in docker / unraid format that will easily connect to the other dockers to use all vRAM to do stuff.
Guinness@reddit
The latency would be atrocious. You’d need some sort of special pcie switch between servers.
Good-Coconut3907@reddit
Maybe for large models at runtime. But if you batch, you gain a huge amount of the perf loss by distributing. I know as I run them frequently with https://github.com/kalavai-net/kalavai-client
vibjelo@reddit
Also, MoE models should potentially be less affected by inter-device bandwidth/latency too, since only parts of the weights needs to be activated
No-Consequence-1779@reddit
Nvidia has this … hehe
Good-Coconut3907@reddit
Might help: https://github.com/kalavai-net/kalavai-client
MDSExpro@reddit
LocalAI can do that
Marksta@reddit
Already exists, spin up a container with GPUStack in it and you're good to go. It uses llama.cpp and it's RPC as backend, and also they've added some initial vLLM support but I haven't tried that. It works pretty alright, but the latency impacts tokens/s over ethernet or st least 1Gb/s ethernet. I haven't tried it yet with something like 25Gbps.
Open-Question-3733@reddit
You mean like https://github.com/gpustack/gpustack ?
vibjelo@reddit
In what way could privacy potentially be an issue with self hosted models? You mean people might explicitly want less privacy?
evilbarron2@reddit
Some standardized way to actually test models for compatibility with features. Every model seems to have their own interpretation of tool use and when to use them.
Good-Coconut3907@reddit
This is huge in my view. I'm working on an automated way to do model benchmarking with custom datasets (somewhat similar to what you mention). Input: list of models + custom dataset; output = performance leaderboard for your business case.
I wonder if this is of interest to anyone
evilbarron2@reddit
Well you know you got my vote
ShittyExchangeAdmin@reddit
Mix of lacking vram and also lacking any further expansion in my server for additional gpus. also being gpu poor.
FinancialMechanic853@reddit
M&M = Memory and money
Fresh_Finance9065@reddit
Buying AMD, or anyone besides Nvidia.
Sriyakee@reddit (OP)
What do you mean by "buying AMD", do you mean running these models on AMD devices?
Fresh_Finance9065@reddit
AMD gpus only get 1, max 2 generations of support exclusively for the x9xx and x8xx cards ON LINUX.
You can use vulkan with windows, but you give up anywhere between 4-8x compute power compared to ROCm on linux. Assuming you are not memory bandwidth bound which you are because AMD cards were designed for gaming, not AI.
You normally get half the memory bandwidth of nvidia's counter part and thus half the speed of nvidia but with infinity cache. Infinity cache does not help with AI inferencing at all.
Ninja_Weedle@reddit
AMD GPUs. CUDA is king still and NVIDIA's the only way your cuda stuff is guaranteed to work without a ton of hassle
Steve_Streza@reddit
Getting full use out of my GPU because it's also my desktop GPU and therefore I'm constantly under-utilizing VRAM because the operating system is also using it.
Selphea@reddit
Many CPUs these days come with integrated graphics, just change where the monitor plugs into.
No-Consequence-1779@reddit
Get a 8gb card for os stuff!
Amazing_Athlete_2265@reddit
Shit; my only card I'd a 8gb card
roadwaywarrior@reddit
Have 4 A6000 on a H12DSi with 512 GB and 192 threads lolololol
Power bill is my problem
simon_zzz@reddit
Not enough VRAM for more context. But, in general, many local LLMs seem to struggle with the task as context gets really big.
Local models are also very unreliable with tool-calling and following to specific instructions.
This has been my experience with models such as Gemma3:27b and Qwen3:32b.
MengerianMango@reddit
Was asking Qwen3:14b a technical question earlier
No-Consequence-1779@reddit
Did you try X ?
MengerianMango@reddit
Lol, once but not twice
mxmumtuna@reddit
Try a third
RottenPingu1@reddit
Trying to deal with all the things that go wrong on a given day with Open Webui. I get so fed up sometimes I feel like bailing to LM Studio
entsnack@reddit
The vLLM backend support in TRL/Transformers is still buggy. So I'm stuck with slow inference during my reinforcement fine-tuning runs.
PassengerPigeon343@reddit
Very specific but I use OpenWebUI with llamaswap and randomly, usually if not used for a day or two, when I send in a query the model fails to load. Sometimes it will just never load and sometimes it will default to CPU inference. Restarting the docker container fixes it 100% of the time. It’s probably something dumb, but I know it will be a struggle for me to figure it out so I haven’t dug too deeply into it yet. It’s the one thing though that has prevented me from really pushing it in my household because it is a little bit unreliable, so for now I’ve just been using it by myself until I can fix this issue.
sunshinecheung@reddit
nvidia gpu too expensive
No-Consequence-1779@reddit
Lack if 5090s. 3 3090s allow for higher quants but unlit I learn how to get parallel widgets widgeting … same speed.
gitcommitshow@reddit
tokens/sec
Agreeable-Prompt-666@reddit
Speed
sub_RedditTor@reddit
Money ..
CV514@reddit
GPU prices.
You_Wen_AzzHu@reddit
VLLM /v1/completions randomly freeze.
Durian881@reddit
Prompt processing speed (that's due to me using Apple).
Zc5Gwu@reddit
Speed. You can either have smartness or speed but not both.
ExplanationEqual2539@reddit
Money