Cost Analysis of my $6.4k Local LLM Server

Posted by 1ncehost@reddit | LocalLLaMA | View on Reddit | 62 comments

I haven't seen any of these done, so I just wanted to share my experience in case it is useful for anyone. The purpose of this post is to show total cost of ownership of my local llm server versus API equivalent. Before you look at the final numbers, note that most people do not do proper financial accounting of hardware. Most people treat hardware as a fully depreciated cost, when in fact hardware typically depreciates slowly or in some cases appreciates over time. This significantly changes the TCO results and explains why the number at the bottom is better than what other people mention.

Hardware

First off here are the shipped hardware prices:

Used 4x MI100 32GB: $4234.82
New ASRock EPYCD8-2T: $721.61
New 1600W 80+ Plat PSU: $497.95
Used 8x8GB DDR4 ECC RDIMMs: $348.79
Used Epyc 7k62 48 core CPU: $254.28
New CPU Cooler: $167.31
New ATX Case: $132.43
4x SATA to USB power cables for blowers: $28.56
4x 75x30mm Blowers for GPUs: $13.76
Plastic sheet for blower fab: $6.94
Storage is a 1TB M.2 drive I had laying around: Free

Total Price: $6406.45

Configuration

The server is currently configured with four separate instances of llama.cpp running Qwen3.6 27B. It is running on Ubuntu with the latest ROCm. It has a low power profile on all components, and in its current workload it is able to process 20.4M input tokens and 1.32M output tokens per day. I do actually use all of this token capacity for a business process. The token output is lower than I expected, and I'll address that in the notes below.

Equivalent API Cost

Qwen3.6 27B currently costs $0.29/M input tok and $3.2/M output tok on OpenRouter. This means that its current processing is worth $5.92 input and $4.22 output per day, totalling $10.14 per day.

Expanding this to a year, API equivalent is $3701.10. Per month that's $308.43.

API Cost: $3701.1 per year

Equivalent in Coding Plans

I thought I'd throw this in here because its hard to quantify otherwise and might be useful. I also use the Z.AI coding plan as an API provider for this same business process. Because of that, I can measure how much they end up giving you in tokens and produce fairly comparable results. I have ZAI's best plan, which is currently $144/mo, and it is allowing me about 4.5M input tokens and 200k output tokens of GLM 4.7 per day. GLM 4.7 is actually a less expensive model on OpenRouter than Qwen3.6 27B believe it or not, and in many benchmarks they are comparable, so this is a more fair comparison than I'd have expected.

Normalizing this, it would cost about $652.8 per month for the same capacity via this plan, or $7833.60 per year. This is more than double the same amount of GLM 4.7 use via OpenRouter or the API cost of Qwen3.6 27B.

So word of caution, the coding plans aren't always a good value. Make sure you know what you're paying for. I actually paid much less for this plan when they were running specials at the start of the year, so it works out better for me, but I certainly won't renew my sub once the year expires.

Local LLM Costs

Electricity

I configured the server with low power profiles, so at full LLM load the whole server is consuming 630 watts at the wall. This translates to 15.1 kwh per day, and at $0.14 per kwh, that is $2.11 to run per day. $0.14 is a worst case for me, with actual cost being more like $0.08 including off hours and winter rates, but its difficult to calculate an accurate estimate so I chose to keep it very conservative.

Expanding that higher rate to a year my Local LLM server costs $770.15 for elec.

Local LLM Cost: $770.15 per year or $64.18 per month

Hardware Depreciation

Next, depreciation is an accounting term which represents how much something loses value over time. Cash accounting like most people are familiar with is not actually accurate because if you own an asset it still has value that can eventually be liquidated to recover part of its price. Depreciation shows you the cost of owning something over time in terms of how much you'd lose if you sold it at that time.

For the hardware, lets say all accessories fully depreciate (total loss), new components depreciate 50%, and used components depreciate 10%.

Accessories: $349 * 100% = $349 New components: $1219.56 * 50% = $609.78 Used components: $4837.89 * 10% = $483.79

I think its reasonable to say this depreciation will be roughly the same one day after purchase or 5 years after purchase. So basically this is a one-time cost that only slightly increases over time.

Local LLM Cost: $1442.57 1-time

Infrastructure

So the server had reliable power that wasn't impacted by other devices in my house, and could withstand startup surge power, I had a new dedicated electricity circuit ran to a new 20 amp breaker. This cost $780 for a pro to do. This isn't entirely necessary, but I felt like it was a good idea long term because the system is possibly capable of saturating a 15 amp circuit.

I already have a homelab with switch, router, and shelving, so this was free for me. I was able to keep power usage to a reasonable level so I don't need extra HVAC. System labor is free because I'm doing it and I enjoy working on computers.

Local LLM Cost: $780 1-time

Total Local LLM Cost & Savings

Adding all that up for my Local LLM setup, the first year's costs arrive at $2992.72. Once again, that is cost not cash outlay. API costs are $3701.1 per year, so this represents a first year savings of $708.38. For subsequent years the operating cost of the local LLM server is $770.15, representing $2930.95 savings assuming API costs stay the same (they will not, but this is for illustration purposes).

First year Local LLM Server cost: $2992.72 Subsequent year Local LLM Server cost: $770.15 API Cost: $3701.1

First year savings: $708.38 Subsequent year savings: $2930.95

Notes

I mentioned that token output is lower than I expected. While I am running a low power profile on these cards, benchmarking showed that they are running at about 70% of the speed of full power. In other words, full power produces around 43% more tokens. That is still under what I was expecting. I think it can generally be explained by the MI100 being a rare card, and it being poorly optimized for in all major LLM software. So even though they have pretty good raw specs, its not delivering what I hoped for. I would say around double the performance is what I was hoping for, as that's the performance of my 7900 XTX which has similar raw specs.

The main reason I got MI100s was because of their ability to use a 3-way interlink bridge. Unfortunately there is next to no documentation out there about these bridges, and I couldn't get it to work with my motherboard after spending days working on it, so I ultimately chose to return it. This was probably the largest disappointment because the interlinks would have been a big edge with mid-size models. As far as I can tell though, the bridge requires very specific PCIE architecture that only a set of supported motherboards from their deployed systems provide.

I would say if I were to do a do over, I'd probably go with prosumer cards like the R9700 or a unified memory setup like a couple DGX sparks. I'd expect them just to be easier all around to work with and give me more options long term. I do have a strix halo laptop, and that type of device (including sparks and apples here) is ultimately an excellent option especially for mid-size models that will hit PCIE in a GPU setup. If you are planning on going with a mid-size model, I'd strongly recommend stacking those type of devices instead of going the way I did because they are quite fast once you start taking into account PCIE and to top it off also use very low power which reduces your elec bill meaningfully.

Hope this helped!

[-]

Cowboysfan2501@reddit

Regarding your experience with the infinity bridge, please share the part number, as I have had no problems with mine which is part number 102D3460400 000001

[-]

1ncehost@reddit (OP)

Thx for the note...What motherboard do you have? The part I used was 102D3460300.

[-]

Cowboysfan2501@reddit

I am using a Epyc 7742 in a Huananzhi H12D-8D

[-]

1ncehost@reddit (OP)

Hmm I'll take another look at this i just noticed some possible issues comparing the PCIe architectures of both boards 🧐

[-]

AnticitizenPrime@reddit

If we're dismissing privacy as (arguably the most important) factor, there are a lot of ways to heavily reduce or eliminate costs when using Openrouter.

Equivalent API Cost

Qwen3.6 27B currently costs $0.29/M input tok and $3.2/M output tok on OpenRouter. This means that its current processing is worth $5.92 input and $4.22 output per day, totalling $10.14 per day.

Expanding this to a year, API equivalent is $3701.10. Per month that's $308.43.

API Cost: $3701.1 per year

So... about this bit. That's if you only used that model as your only model on Openrouter for the whole year.

I've been using Hermes Agent lately. One feature of Hermes is that you can assign a primary orchestrator model as the main model, and then there are a myriad of other auxiliary model roles that you can assign to other models (either local or API). Roles like, a dedicated vision model (for describing images or OCR), a text summarization model, a coding model, the model that does nothing but generate chat summary titles, etc.

And OR has a lot of free models, so many so that you can get by almost entirely for free. In fact, I'm using owl-alpha as my primary model right now (which is a 'stealth' model that won't stay free forever), so I haven't spent a dime in the past two weeks of pretty heavy usage (~900 million tokens in the past week). Here's my current array of models:

OpenRouter Models in Use

Model	Role(s)	Timeout	Notes
`openrouter/owl-alpha`	🧠 Primary (default)	—	Main agent model for all conversations
`google/gemma-4-26b-a4b-it:free`	👁️ Vision	120s	Image analysis (replaced Nemotron VL)
`nvidia/nemotron-3-super-120b-a12b:free`	📄 Web Extract	360s	Web page extraction
`nvidia/nemotron-3-super-120b-a12b:free`	🗜️ Compression	120s	Context compression / summarization
`nvidia/nemotron-3-super-120b-a12b:free`	🏛️ Curator	600s	Skill curation (longest timeout)
`openai/gpt-oss-120b:free`	✅ Approval	30s	Approval decisions
`openai/gpt-oss-120b:free`	📋 Triage / Specifier	120s	Task triage and specification
`openai/gpt-oss-120b:free`	🗂️ Kanban Decomposer	180s	Kanban task decomposition
`qwen/qwen3-coder:free`	🔧 Skills Hub	30s	Skills-related tasks
`qwen/qwen3-coder:free`	🔌 MCP	30s	MCP tool integration
`meta-llama/llama-3.2-3b-instruct:free`	🏷️ Title Generation	30s	Session/note title generation
`meta-llama/llama-3.2-3b-instruct:free`	👤 Profile Describer	60s	User/peer profile descriptions

I actually had Hermes itself analyze all the free models on OR and make its own judgments about which ones to used based on the model capabilities, speed, benchmarks, etc. The only thing I changed was its choice of vision model - it picked a Nemotron VL 12b model that tended to hallucinate facts about images, so I manually changed that one to Gemma 4.

Obviously we self-host for reasons of privacy, etc. I just wanted to point out that the math isn't so straightforward as calculating the use of a single model, when by specializing which models are used for what, you can cut down on token usage and costs immensely.

I suppose that's true of self-hosting, as well. Instead of running that single Qwen model for everything, you might consider using smaller dedicated models for certain tasks - if you can run them alongside your main model, you might get lower latency overall and less power draw. In fact, it might even make sense to have a second machine with much lower power draw to host small auxiliary models, like Gemma4 4B model for summarizing text, translation, and vision.

Or you could do a hybrid setup, with your main model being local, but leveraging free models for context compression, vision, or web page extraction, etc.

I'm just saying using API could be a LOT cheaper.

The best reason for self-hosting is still privacy, IMO.

[-]

Qwen30bEnjoyer@reddit

That free API is rate limited, and to my memory saves your responses.

The comparison I would make would be a cheap chinese coding plan since they sometimes offer "decent" privacy terms like processing in Euro or SIngapore servers, generous limits, and models that are more usable for less structured coding tasks.

I self-host sometimes, but since I'm just a student right now, I've can't justify investing in running a model bigger than Qwen 3.6 27b or 35b a3b when I can currently pay $10 a month for effectively unlimited MiniMax M2.7 for agentic workloads.

Also I know this is a shitty take to read on a subreddit all about self-hosting, but I think for chatbot uses where you have few tokens in, few tokens out, a $20 Chat or Claude plan is still genuinely worth it. The limits are terrible for anything heavy, but I think there's a lot of value in having a chatbot that is more resistant to sycophancy and hallucination without needing a custom RAG-citation-generating grounded workflow.

[-]

AnticitizenPrime@reddit

That free API is rate limited

Yeah, that's why most free models shouldn't be your main model because you will hit rate limits. But if you're using them as auxiliary models (vision, text summaries, etc) you're way less likely to hit any rate limits and they cut your overall paid token usage way down.

Right now I'm using that owl-alpha stealth model as my main model and haven't hit any rate limits (they don't seem to rate limit stealth models). It will go away at some point and I'll probably switch to Deepseek V4 Flash unless a new good stealth model comes along.

Last night I actually did switch to local for some roles, as Gemma4 2b is good enough and actually faster local than what I was using on OR before for some tasks. Slower for vision than API but faster for text tasks:

✅ Local Gemma-4-E2B Auxiliary Models — Switched Live

Changes Made

5 roles switched from OpenRouter → local:

Role	Before	After
Vision	`openrouter/gemma-4-26b-a4b-it:free` (26B, ~4s)	`local-llama/gemma-4-e2b` (2B, ~25s)
Title generation	`openrouter/llama-3.2-3b-instruct:free` (3B, ~2s)	`local-llama/gemma-4-e2b` (2B, ~0.4s)
Profile describer	`openrouter/llama-3.2-3b-instruct:free` (3B, ~2s)	`local-llama/gemma-4-e2b` (2B, ~0.4s)
Skills hub	`openrouter/qwen3-coder:free` (8B, ~3s)	`local-llama/gemma-4-e2b` (2B, ~0.4s)
MCP	`openrouter/qwen3-coder:free` (8B, ~3s)	`local-llama/gemma-4-e2b` (2B, ~0.4s)

6 roles kept on OpenRouter (need large context/heavy reasoning): web_extract, compression, approval, triage_specifier, kanban_decomposer, curator

[-]

ipcoffeepot@reddit

dont know what support for MI100 looks like, but you should see if you can run sglang or vllm. will get better concurrent throughput (more tok/s) than with 4x instances of llama.cpp

[-]

LegacyRemaster@reddit

I keep reiterating that API costs must be carefully calculated:

1) RTX 6000 + w7800 48GB x 2. 300W + 200W + 200W (I lowered the voltage on all of them). The system consumes about 900W at full load (which rarely happens).

2) I use the local system for coding (vscode + claude or kilo or opencode or cline), video creation, image creation, music creation, and meshes.

3) How many APIs do I have to buy and how many tokens do I have to pay to do what I do locally with only the cost of electricity? In the winter, I also save on heating.

4) My workstations have already paid for themselves with the products I sell.

100% privacy, I can use "heretical" films to generate content I can't generate online (try writing technical reports on military systems, for example), I'm not tracked... I think it's a great investment to create your own infrastructure. Furthermore, if I sold the entire setup at today's price, I'd earn at least $4,000 more than when I bought it.

[-]

nbvehrfr@reddit

you can rent rtx 6k for 0.6$/h and run your heretic there.

[-]

buttplugs4life4me@reddit

The issus is APIs are expensive and renting is even more expensive unless you spend some time working on some solution to start/stop these automatically. As far as i can see Theres no good one available or they're even more expensive (2$ per hour)

[-]

nbvehrfr@reddit

Packet.ai 0.66$/hour rtx 6000 pro

[-]

Momsbestboy@reddit

As private person: yes. You have a business running or are a freelancer developer: good luck with all the requirements concerning data safety. No company will hand you over any internal secrets if you use a public available low cost llm to handle the information.

[-]

Maximum-Style2848@reddit

They just ignore the requirements. Doctors will use the free version of ChatGPT without a second thought.

[-]

tat_tvam_asshole@reddit

Tbf, they won't run on "private" hobbyist servers either.

[-]

buttplugs4life4me@reddit

I just did a comparison on the ChatGPT coding plan and equivalent API cost would ve 200$ (Im on the 20$ plan lol). Obviously this is a huge value for me, but I do hit limits frequently and as you said can't really do a lot of the more interesting stuff with it. The guardrails also suck sometimes. I wanted to edit a photo of my own kid on a grill and it just refused outright.

Also they can change these limits any time without me noticing, and the limits won't be this high forever as they will start feeling financial pressure at some point.

Lastly I'd like to try my own fine-tuning, testing and maybe even model training/expansion.

[-]

TheRealMasonMac@reddit

> photo of my own kid on a grill

Sir, you're not supposed to put your kids on a grill.

[-]

flyingroad@reddit

But people do grill people

[-]

Major-Currency528@reddit

How could you justify the price against deepseek v4 pro or Xiaomi mimo 2.5 pro's outrageously cheap cache hit price and price in general, both of which are far better than either Qwen 3.6 27b or glm 4.7.

A session using deepseek v4 pro with 42 million cache hits cost me 0.80 cents.

Considering the hardware will only depreaciate, and very quickly at that as massive amounts of compute come online in the coming 2 years, coupled with the fact opensource api prices are only going to improve as more architectural innovation/ better Gpus come out.

[-]

notdba@reddit

This. The analysis from OP was deeply flawed by not including the cache read price, which is usually >90% of the total cost when using APIs for agentic coding. In that world, the near-free infinite cache read with local inference can beat using APIs.

That is, until deepseek, and then mimo, lowered the cache read price by 100x. Half a billion of cache read tokens cost $1.40. At this point, both the US providers and the local inference crowds got nothing to counter that.

[-]

machinegunkisses@reddit

I'm not OP, but it's possible their workload would not benefit from caches because they only make a single, large call with some new data each time.

[-]

Major-Currency528@reddit

i can't imagine how many workloads would not benefit form agentic use over a chat like experience, and besides the cache price, the input and output price is still cheap

[-]

DeltaSqueezer@reddit

I have ZAI's best plan, which is currently $144/mo, and it is allowing me about 4.5M input tokens and 200k output tokens of GLM 4.7 per day.

Why is your limit so low? I'm using GLM-5.1 on the middle tier plan and in the last 30 days I have well over 1 Trillion tokens total (input and output).

[-]

1ncehost@reddit (OP)

I'll double check im counting correctly. Thank you for the comment. I've added a note in the post for that section until I can verify.

[-]

SoAnxious@reddit

Nice informative read.

What many people that think cloud AI will keep growing and growing don't understand is most enterprises actually used to have personal servers before the SaaS and cloud migration. That was a legitimate business expense. Cloud was cheap to start to find users but they got greedy.

Once you get a server as a legitimate expense at a good price with models constantly improving you are good to go for years to come. It's not like you get a certain amount of RAM now and you are just slow.

Businesses love fixed costs that can be depreciated and owning their data. So I don't see cloud models staying a must have for businesses as there's not that much upward momentum in the AI space, somethings akin to Opus 3.6 solves the use case of AI and anything above that is pushing it to be a must have.

We are getting smaller and better local models that meet that performance.

[-]

Hydroskeletal@reddit

The dynamics are different for an enterprise versus a small biz or solo. You have to remember that running that stuff isn't a "core competency" and all the big cloud providers were initially there to serve internally. That and accounting loves reducing things to a single line item (eg AWS bill)

[-]

SoAnxious@reddit

Cloud gaming never took off because people don't like paying more for the "new shiny" over owning their hardware and paying nothing.

The other biggest competitor will be local models on cell phones thethering. Business cellphones are already an expense and cellphones are the best AI model runners because of their architecture.

Apple has every reason to run the local model hardware upgrade every year angle for B2B and B2C it gives the perfect reason to put a larger margin on phones and decrease the time people upgrade them.

[-]

Hydroskeletal@reddit

If the angle is AI is just another piece of software on an employee's assigned devices, then yeah. I think that I don't think the appetite is there for large businesses to run their own server farms for inference.

[-]

bnightstars@reddit

Did you tested the same process on Qwen3.6-35B what is the reason you want to run 27B dense model instead of a MoE one if you already have a particular workflow in mind ?

[-]

1ncehost@reddit (OP)

This system is being used for a difficult data extraction problem where better models make a difference.

[-]

relmny@reddit

I had 35b finding the needle in a haystack every time (tried same prompt/pdfs about 5 times) while 27b kept saying it wasn't there...

Since then for RAG and so, ai trust 35b instead lof 27b

[-]

Available_Hornet3538@reddit

Yes 27b great with excel data.

[-]

Enough_Big4191@reddit

honestly the part i appreciated most here was treating hardware like an actual depreciating asset instead of “gpu cost = gone forever.” most people skip that entirely when comparing local vs api costs. also the “works great until weird infra edge cases show up” part felt very real. getting agents stable across odd hardware/software combos is always where the hidden time goes.

[-]

Future_Manager3217@reddit

This is one of the better local-vs-API writeups because you’re comparing against an actual workload, not just theoretical tok/s.

The extra column I’d add is something like “accepted output per dollar”: raw tokens × success rate after validation/retries/rework. For a real business process, a local rig that is cheaper on token-equivalent cost can still lose if it adds review time, failed extractions, driver downtime, or fallback-to-API periods. The reverse is also true: if local privacy/batching lets you remove review steps or run jobs you would never send to an API, the raw OpenRouter-equivalent number understates the value.

So I’d treat token-equivalent TCO as the starting point, not the denominator.

[-]

Qwen30bEnjoyer@reddit

We really need a decentralized cloud provider system so that people like you can serve inference when the rig is inactive with minimal friction to the hoster (As in, the load is cut off when the original owner uses it for minimal disruption, and the current job is routed to a different server).

I say that not with the intention of replacing traditional clouds and their economies of scale, but from the perspective that it would be healthier for the open source ecosystem if model self-hosters had a away to amortize that cost across more users when they want.

[-]

1ncehost@reddit (OP)

I spent a day thinking about this a number of months ago, and the issue is that tokens arent necessarily fungible. In other words, quality of tokens is variable (quants, different models) so there would need to be a lot of quality assurance and redundancy in something like this. It would be complex and maybe require quite a bit of infra to do right.

[-]

CrookedCasts@reddit

So let’s say you had a small business that needed to stay local for data security reasons - no cloud calls. Needs to be able to locally do real time voice concurrently on two phone calls. Some of the responses could be scripted, but needs to be able to integrate with a few archaic systems for real time data analysis to feed the responses. I feel like I see not a lot of real time voice builds here… is there a good subreddit for that? Or what has anyone found success with here?

[-]

1ncehost@reddit (OP)

There are definitely people who use TTS and STT here. Ive made a pretty decent outbound phone call confirmation system myself. The topic isnt as popular here but this is still the right place IMO.

[-]

ikkiho@reddit

fwiw I did the same exercise on a single-3090 box and the line item nobody puts in these is the weeks the rig sat idle because a driver update broke my inference stack and I fell back to api anyway. My calendar shows about 6 weeks last year of pure driver-roulette downtime. Doesn't kill the case but it pushed my payback from 15 to closer to 24 months in my own numbers.

[-]

BitGreen1270@reddit

Thanks for the write up. I think will be good if you mentioned some of the quality of life changes as well. For example, if you are more interested in local llms having your own hardware greatly reduces the friction in working on it. Also downside of time spent on getting 4xMI100 cards working and optimizing them.

[-]

michaelsoft__binbows@reddit

Curious how well 3090 stacks up to these MI100. I too did a lot to get 3090s on nvlink and i did get it working (on a pair of cards of different heights no less) but at the end of the day it's not really much of a factor for inference.

The real interesting thing is the most effective way today to get bang for the buck by far with local hosting is the 27B model that a single GPU can host, which is a hilarious and awesome state of affairs. We can pump like 1000 to 2000 tok/s batched throughput out of each 3090/4090/5090 and this really cranks up ROI with regard to its $3/Mtok output API pricing calculation.

But if you tie down your rig to run a 120B model at 400tok/s throughput you just took like a good 20x hit to the economy and instantly youre completely in the red compared to API pricing.

Apple silicon can help by stretching the electricity farther but once batched throughput is taken into account it evens out again.

[-]

electrified_ice@reddit

Good analysis. There are definitely pros and cons to each. I think th value of self hosting is beyond the $ cost. I went way too far down th rabbit hole >$40K... The flip side is that if you try to show ROI beyond a year or 2, what you're not factoring in is API's will always be connecting to the latest models, and your hardware may not be able to run the latest models, and the rate open source models are progressing, that's a real thing to factor in to the ROI. Maybe your current hardware will run the new models (well) that come out in 2-3 years, but maybe not.

[-]

btb0905@reddit

This is interesting. I've got a very similar setup and total expense. What does your inference stack look like? I've made some progress recently on speeding up inference with vllm. I'm getting some pretty good performance out of these cards now. Check out my repo with benchmarks. If you haven't already give my docker container a try. https://github.com/btbtyler09/mi100-llm-testing

I think when you evaluate cost from the perspective of a single user using it as an endpoint for coding you're going to find it's value as lower than full potential. Lately i have been using my cards for synthetic dataset generation using qwen3.6-35B along with fine tuning local models. For thesr kinds of tasks i found apis far to expensive. Burned hundreds of dollars on runpod and openrouter to do the same, but the mi100s are far more cost effective. It's also been very educational to have this setup and has given me the experience i needed to setup self hosted models and tools at work.

I'm considering buying 4 more of these because my current hive is busy 24/7 now. I need to find a way to monetize these...

[-]

1ncehost@reddit (OP)

Will definitely check it out thanks. I decided not to go with vllm because quant support is fairly lacking for mi100. Im using llama.cpp MTP to get about a 50% speedup in my results. I rarely see mi100 users so should work together to get them working better.

Could you tell me more about your vllm setup and what your changes accomplished?

[-]

orinoco_w@reddit

In testing on my mi100 I've found that q8_0 gives much better prefill performance than q6, q5 etc. Might be worth evaluating your configs.

Definitely look at btb's vllm setup, but you may also with to consider two instances of 2xmi100, and with MTP and tensor parallel, llama.cpp is performing increasingly well.

Token gen keeps up with my 7900xtx and prefill is only slightly behind it.

The r9700 comment is exactly where I'm at.. more mi100's or just switch to buying r9700 for more future proofing. Sadly my hardware budget is already gone for this month.

[-]

btb0905@reddit

Best place to start is my vllm fork. Most of the gains came from finding optimal execution paths through pre existing kernel options. Bas vll gates a lot of things by gpu arch and that forces the mi100s down default paths with unoptimized kernels. Finding the right ones is trial and error and occasionally takes modifying packages like ck flash attention or AITER to allow for building for mi100s.

https://github.com/btbtyler09/vllm-gfx908

For qwen3.5/3.6 i set claude up on loop and just told it to optimize performance for me and the result nearly doubled decode performance.

Like you said, no one was quantizing models for mi100 either, so about a year ago i started doing it myself. I settled on gptq as the most reliable and i publish the popular models when I can. I'm somewhat limited in model sizes i can quantize, but i try to do what i can. I also publish exaple quant scripts in my repos so others can use them. I have found group size seems to be important for these cards and tend to stick to 32 for that. Most models use defaupt group size of 128 and many of these generate gibberish when running on mi100s. You can see my hf repo here: https://huggingface.co/btbtyler09

[-]

1ncehost@reddit (OP)

This is amazing work. Really appreciated. I'll message you when I get time to test. I'd also like to contribute when I get some time, so are you down for a PR or two?

[-]

havenoammo@reddit

I was going to suggest your repo, but I didn't notice the owner was here, haha 😄

[-]

Ulterior-Motive_@reddit

This is why I always roll my eyes when someone makes a blanket statement that APIs are cheaper. Yeah, you need a decent investment up front, and research, and planning, but you can absolutely save money too by going local if you chew through tokens.

[-]

a_beautiful_rhind@reddit

Having to only buy 8x8gb is just sad.

[-]

1ncehost@reddit (OP)

Its barely used except during model loading so not needed. Larger dimms are a lot more expensive. 8gb is a much better value especially while prices are high.

[-]

a_beautiful_rhind@reddit

If you are only doing fully offloaded then it's just being used for loading.. but it's cool to be able to hybrid larger models than your GPUs support and you bought a decent chip for that.

[-]

1ncehost@reddit (OP)

The plan would be to upgrade it one day if it makes sense, but it was just not a good option with todays ram prices.

[-]

a_beautiful_rhind@reddit

Same thing got me with trying to buy more ram. Barely a year ago you would have had 8x32gb for similar prices.

[-]

havenoammo@reddit

If you have identical hardware like that, try using vLLM. It should be a lot faster to work with than four separate llama.cpp instances.

[-]

1ncehost@reddit (OP)

I did consider vllm, but mi100 quant support is fairly poor and llama.cpp has much better options available plus some of the latest optimizations like MTP. I should benchmark vllm at least some day, but llama.cpp is working fairly well and I've got bigger fish to fry.

[-]

piscoster@reddit

super nice information! thank you! Would you consider this setup low cost/minimum for running a decent setup?

[-]

1ncehost@reddit (OP)

If you are going with a midsize GPU-based system, this is a pretty good setup. I just would go with another card than the MI100s if I were to do over.

[-]