How much VRAM do you have and what's your daily-driver model?

[-]

segmond@reddit

daily driver deepseek-r1-0528, qwen3-253b and then whatever other models I happen to run, often keep gemma3-27b going for simple tasks that need fast reply. 425gb vram across 3 nodes.

[-]

segmond@reddit

lol, I actually counted it's 468gb

but 7 24gb 3090's, 1 12gb 3080ti, 2 12gb 3060, 3 24gb P40s, 2 16gb V100 10 16gb MI50

[-]

Pedalnomica@reddit

And here I am... slummin' it with a mere 10x 3090s and 1x 12gb 3060...

[-]

PermanentLiminality@reddit

I want to buy stock in your electric utilities.

[-]

Pedalnomica@reddit

The plan is to keep the 3060 always running and ready. I'll only power up the 3090s when I'm using the big models. That's the plan anyway...

[-]

BhaiBaiBhaiBai@reddit

What does that make me then, running Qwen3 30B A3B on my ThinkPad's Intel Iris Xe?

[-]

Z3r0_Code@reddit

Me crying in the corner with my 4gb 1650. 🥲

[-]

FormalAd7367@reddit

Wow how much did it cost you for that build?

[-]

segmond@reddit

Less than an apple M3 studio with 512gb.

[-]

FormalAd7367@reddit

i’m not jealous.. was it originally a Bitcoin mining motherboard…?

[-]

segmond@reddit

1 of the node is a mining server with 12 pcie slots, the others are dual x99 boards with 6 pcie slots. if you click on my profile you can see the pinned post of my server builds.

[-]

FormalAd7367@reddit

thanks - wish i saw your pinned post few months ago. i built mine so much more expensive

[-]

Pedalnomica@reddit

But seriously, I'm curious how the multi node inference for deepseek-r1-0528 works, especially with all those different GPU types.

[-]

ICanSeeYourPixels0_0@reddit

I run the same on a M3 Max 32GB MacBook Pro along with VSCode

[-]

same, although I've given up on the qwen3 because r1 0528 beats it by a lot. gemma3-27b like you for everything else including vision. I also keep the 4b around which open-webui uses for tagging and summarizing each chat very quickly. m3 ultra 512.

[-]

false79@reddit

I am looking to get m3 ultra 512GB. do you find it's overkill for models you find the most useful? Or do you have any regret where you got have a cheaper hardware configuration more fine tuned to what you do most often?

[-]

Hoodfu@reddit

I have the means to splurge on such a thing, so I'm loving that it lets me run such a model at home. It's hard to justify though unless a one time expense like that is easily within your budget. It doesn't run any models particularly fast, it's more just that you can at all. I'm usually looking at about 16-18 t/s on these models. qwen 235b was faster because it's activate parameters was less than gemma 27b. something to also consider is the upcoming rtx 6000 pro that might be in the same price range but probably around double the speed if youre fine with models inside of 96 gigs of ram.

[-]

segmond@reddit

r1-0528 is so good, i'm willing to wait through the thinking process. I use it for easily 60% of my needs.

[-]

After-Cell@reddit

What’s your method to use it while away from home?

[-]

segmond@reddit

private vpn, I can access it from any personal device, laptop, tablet & phone included.

[-]

After-Cell@reddit

Doesn’t that lag out everything else? Or you have a way to selectively apply the VPN on the phone?

[-]

tutami@reddit

How do you handle models not being up to date?

[-]

hak8or@reddit

420+gb vram across 3 nodes.

Are you doing inference using llama.cpp's RPC functionality, or something else?

[-]

segmond@reddit

not anymore, with offloading of tensors, I can get more out of the GPUs. deepseek on one node, qwen3 on another, then a mixture of smaller models on the other.

[-]

RenlyHoekster@reddit

3 Nodes: how are you connecting them, with Ray for example?

[-]

Easy_Kitchen7819@reddit

7900xtx Qwen 3 32B 4qxl

[-]

zubairhamed@reddit

640KB ought to be enough for anybody...

....but i do have 24GB

[-]

mobileJay77@reddit

640K is enough for every human 😃

This also goes to show, how much the demand in computing fills up all gains of productivity and Moore's law. Why should we need less developers?

[-]

stoppableDissolution@reddit

We will also need more developers if compute scaling slows down. Someone will have to refactor all the bloatware written when getting more compute was cheaper than hiring someone familiar with performance optimizations

[-]

StandardPen9685@reddit

Mac mini M4 pro 64gb. Gemma3:12b

[-]

IllllIIlIllIllllIIIl@reddit

I have 8MB of VRAM on a 3dfx Voodoo2 card and I'm running a custom trigram hidden markov model that outputs nothing but Polish curses.

[-]

pixelkicker@reddit

*but heavily quantized so sometimes they are in German

[-]

kremlinhelpdesk@reddit

I just run the magistral-small-2506-zdunk, produces all sorts or functional hexes and curses in polish. Would probably help to run it on esoteric hardware, though.

[-]

techmago@reddit

you should try templeOS then

[-]

Zengen117@reddit

All of the upvotes for templeOS XD

[-]

pun_goes_here@reddit

RIP 3dfx :(

[-]

jmprog@reddit

kurwa

[-]

notwhobutwhat@reddit

Qwen3-32B-AWQ across two 5060's, Gemma3-12B-QAT on a 4070, and BGE3 embedder/reranker on an old 3060 I had lying around. Just running them all in an old gaming rig I had lying around, i9900k with 64GB, using OpenWebUI on the front end. Also running Perplexica and GPT Researcher on th same box.

Getting 35t/s on Qwen3-32B, which is plenty for help with work related content creation, and using MCP tools to plug any knowledge gaps or verify latest info.

[-]

The_Crimson_Hawk@reddit

Llama 4 maverick on cpu, 8 channel ddr5 5600

[-]

Hurricane31337@reddit

EPYC 7713 with 4x 128 GB DDR4-2933 with 2x RTX A6000 48 GB -> 512 GB RAM with 96 GB VRAM

Using mostly Qwen 3 30B in Q8_K_XL with 128K tokens context. Sometimes Qwen 3 235B in Q4_K_XL but most of the time the slowness compared to 30B isn’t worth it for me.

[-]

BeeNo7094@reddit

How much was that 128Gb ram? You’re not utilising 4 channels to be able to expand to 1TB later?

[-]

Hurricane31337@reddit

I paid 765€ including shipping for all four sticks.

Yes, when I got them, DeepSeek V3 just came out and I wasn’t sure if even larger models will come out. 1500€ was definitely over my spending limit but who knows, maybe I can snatch a deal in the future. 🤓

[-]

BeeNo7094@reddit

765 eur definitely is a bargain compared to the quotes I have received here in India. Do you have any CPU inference numbers for ds q4 or any unsloth dynamic quants? Using ktransformers? Multi GPU helps with ktransformers?

What motherboard?

[-]

Hurricane31337@reddit

Sorry I’m not at home currently, I can do it on Monday. Currently I’m using Windows 11 only though (because of my company, was too lazy to setup Unix dual boot).

[-]

BeeNo7094@reddit

Hey, let me know if you get the time to do this.

[-]

eatmypekpek@reddit

How are you liking the 512gb of RAM? Are you satisfied with the quality at 235B (even if slower)? Lastly, what kinda tps are you getting at 235B Q4?

I'm in the process of making a Threadripper build and trying to decide if I should get 256gb, 512gb, or fork over the money for 1tb of DDR4 RAM.

[-]

Hurricane31337@reddit

Sorry I’m not at home currently, I can measure it on Monday. Currently I’m using Windows 11 only though (because of my company, was too lazy to setup Unix dual boot). If I remember correctly, Qwen 3 235B Q4_K_XL was like 2-3 tps, so definitely very slow (especially with thinking activated). Qwen 3 30B Q8_K_XL is more than 30 tps (or even faster) and mostly works just as well, so I’m always using 30B and rarely, if 30B spits out nonsense, I switch to 235B in the same chat and let it answer a few messages 30B wasn’t able to answer (better slow than nothing).

[-]

Dismal-Cupcake-3641@reddit

I have 12 GB Vram I generally use the quantized version of Gemma 12B in the interface I developed. I also added a memory system and it works very well.

[-]

Zengen117@reddit

I'm running the same setup. Gemma3:12b-qat RTX 3060 with 12GB VRAM and I use open-webui for remote accessible interface.

[-]

DrAlexander@reddit

With 12Gb VRAM I also mainly stuck to the 8-12b q4 models, but lately I've found that I can also live with the 5tok/s from gemma3 27B if I just need 3-4 answers or I set up a proper pipeline for appropriately chunked text assessment and leave it running overnight.

Hopefully soon I'll be able to get one of those 24GB 3090s and be in league with the bigger small boys!

[-]

Dismal-Cupcake-3641@reddit

Yes, now we both need big VRAMs. But I think about what could be different every day. I want to do something that will make even a 2D or 4D model an expert in a specific field and give much better results than large models.

[-]

After-Cell@reddit

Please give me a keyword to investigate the memory

And also,

How do you access it when not at home on site?

[-]

Dismal-Cupcake-3641@reddit

I rented a vps, I make an api call to it, and since it is connected to my computer at home via an ssh tunnel, it makes an api call to my computer at home, gets the response and sends it to me. I developed a simple memory system for myself, each conversation is also recorded, so the model can remember what I'm talking about and continue where it left off.

[-]

After-Cell@reddit

Great approach! I’ll investigate for sure

[-]

Dismal-Cupcake-3641@reddit

Thanks :)

[-]

fahdhilal93@reddit

are you using a 3060?

[-]

Dismal-Cupcake-3641@reddit

Yes RTX 3060 12GB.

[-]

Judtoff@reddit

I peaked at 4 p40 and a 3090, 120GB. Used mistral large 2. Now that gemma3 27b is out I've sold my p40s and im using two 3090s. Quantized to 8 bits and using 26000 context. Planning on 4 3090s eventually for 131k context.

[-]

No-Statement-0001@reddit

i tested llama-server, SWA up to 80K context and it fit my on dual 3090s with no kv quant. With q8, pretty sure it can get up to the full 128K.

Wrote up findings here: https://github.com/mostlygeek/llama-swap/wiki/gemma3-27b-100k-context

[-]

Judtoff@reddit

I'll have to check this out. I've got the third 3090 in the mail, but avoiding a fourth would save me some headaches. Even if the third ends up being technically unnecessary, I'd like some space to run TTS and SST and a diffusion model (like SDXL), so the third won't be a complete waste. Thanks for sharing!

[-]

After-Cell@reddit

How do you use it when not at home in front of it ?

[-]

No-Statement-0001@reddit

wireguard vpn.

[-]

Klutzy-Snow8016@reddit

For Gemma 3 27b, you can get the full 128k context (with no kv cache quant needed) with BF16 weights with just 3 3090s.

[-]

Judtoff@reddit

Oh fantastic haha, I've got my third 3090 in the mail and fitting the fourth was going to be a nightmare (I would need a riser), this is excellent news. Thank you!

[-]

Eden1506@reddit

mistral 24b on my steam deck at around 4 tokens/s

[-]

LA_rent_Aficionado@reddit

I still use APIs more for a lot of uses with Cursor but when I run locally on 96gb vram -

Qwen3 235B A22 Q_3 at 64k context Q_4 kv cache Qwen 32B Dense Q_8 at 132k context

[-]

ExtremeAcceptable289@reddit

8gb rx 6600m, 16gb system ram. i (plan to) main qwen3 30b moe

[-]

relmny@reddit

The monthly "how much VRAM and what model" post, which is fine, because these things change a lot.

With 16gb VRAM/128gb RAM, qwen3-14b, and 30b. If I need more 235b and if I really need more/the best, deepseek-r1-0528

With 32gb VRAM/128gb RAM the above except the 32b instead of 14b. The rest is the same.

[-]

Dyonizius@reddit

With 32gb VRAM/128gb RAM the above except the 32b instead of 14b. The rest is the same

same here, how are you running the huge moe's?

*pondering on a ram upgrade

[-]

relmny@reddit

-m ../models/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ot ".ffn_.*_exps.=CPU" -c 16384 -n 16384 --prio 2 -t 4 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 -ngl 99 -fa

offloading the MoE to CPU (RAM)

And this is deepseek-r1 (about 0.73t/s) but with ik_llama.cpp (instead of vanilla llama.cpp), although I "disable" thinking usually, but I only run it IF I really need to.

-m ../models/huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf --ctx-size 12288 -ctk q8_0 -mla 3 -amb 512 -fmoe -ngl 63 --parallel 1 --threads 5 -ot ".ffn_.*_exps.=CPU" -fa

[-]

Dyonizius@reddit

for 32Gb vram try this

in addition, use all physical cores on moes

for some reason it scales linearly

[-]

MidnightHacker@reddit

What quant are you using for R1? I have 88Gb of RAM, thinking abut upgrading to 128Gb

[-]

relmny@reddit

ubergarm/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf

but I only get about 0.73t/s with ik_llama.cpp.

[-]

Dyonizius@reddit

the _R4 is pre-repacked so you're probably not offloading all possible layers right?

[-]

Weary_Long3409@reddit

3x3060. Two are running Qwen3-8B-w8a8, and the other one is running Qwen2.5-3B-Instruct-w8a8, embedding model, and whisper-large-v3-turbo.

Mostly for classification, text similarity, comparison, transcription, and it's automation. Those which running 8B are old badass serving concurrent request, prompt processing at peak 12,000-13,000 tokens/sec.

[-]

ATyp3@reddit

I have a question. What do you guys actually USE the LLMs for?

I just got a beefy m4 MBP with 48 gigs of RAM and really only want 2 models. One for raycast so I can ask quick questions and one for “vibe coding”. I just want to know.

[-]

Maykey@reddit

16GB. unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF for local

[-]

p4s2wd@reddit

2080TI 22G * 7 + 3090 * 1, there are total 178G VRAM.

It's running Deepseek-v4-0324-UD-Q2 + Qwen3-32B bf16.

[-]

Goldkoron@reddit

96gb VRAM across 2 3090s and a 48gb 4090D

However, I still use Gemma3-27b mostly, it feels like one of the best aside from the huge models that are still out of reach.

[-]

Dead-Photographer@reddit

I'm doing gemma 3 27b and qwen3 32b q4 or q8 depending on the use case, 80gb RAM + 24gb VRAM (2 3060s)

[-]

DAlmighty@reddit

Oh I love these conversations to remind me that I’m GPU poor!

[-]

norman_h@reddit

352gb vram across multiple nodes...

DeepSeek 70b model locally... Also injecting DNA from gemini 2.5 pro. Unsure if I'll go ultra yet...

[-]

Frankie_T9000@reddit

0 vram 512gb ram (machine has 4060 to but don't use it for this llm). Deepseek q3_k_l

[-]

Zengen117@reddit

Honestly. Im running Gemma3-it-qat 12b on a gaming rig with an RTX 3060 (12GB VRAM). With a decent system prompt and a search engine API key in open-webui its pretty dam good for general purpose stuff. Its not gonna be suitable if your a data scientist, if you wana crunch massive amounts of data or do alot with image/video. But for modest general AI use, question and answer, quick web search summaries etc, it gets the job done pretty good. The accuracy benefit with the QAT models on my kind of hardware is ENORMOUS as well.

[-]

fizzy1242@reddit

72 vram across three 3090s. I like mistral large 2407

[-]

FormalAd7367@reddit

Why do you prefer mistral large over deep seek? I’m running 4 x 3090.

[-]

fizzy1242@reddit

Would be too large to fit.

[-]

RedwanFox@reddit

Hey, what motherboard do you use? Or is it distributed setup?

[-]

fizzy1242@reddit

board is Asus rog crosshair viii dark hero x570. all in one case

[-]

Ok_Agency8827@reddit

Do you need the NVLink peripheral, or does the motherboard handle the SLI? Also, what power supply do you use? I don't really understand about how to SLI these GPUs for multi GPU use.

[-]

fizzy1242@reddit

No nvlink, it's not necessary. my psu is 1500W, but I still powerlimit gpus to keep thermals under control

[-]

RedwanFox@reddit

Thanks!

[-]

candre23@reddit

I also have three 3090s and have moved from largestral tunes to CMD-A tunes.

[-]

fizzy1242@reddit

I liked command A too, but i'm pretty sure exl2 doesn't support it yet unfortunately. Tensor splitting it in llamacpp isn't very fast

[-]

Zc5Gwu@reddit

Curious about your experience with mistral large. What do you like about it, speed, compared to other models?

[-]

fizzy1242@reddit

i like how it writes, it's not as robotic in conversing in my opinion. speed is good enough at 15t/s with exl2

[-]

Mescallan@reddit

M1 MacBook air, 16gig ram

Gemma 4b is my work horse because I can run it in the background doing classification stuff. I chat with Claude, and use Claude code and cursor for coding.

[-]

ArchdukeofHyperbole@reddit

6 gigabytes. Qwen 30B. I use online models as well but not nearly as much nowadays

[-]

philmarcracken@reddit

is that unsloth? using lm studio or something else?

[-]

ArchdukeofHyperbole@reddit

Lm studio and sometimes use a python wrapper of llama.cpp, easy_llama.

I grabbed a few versions of the 30B from unsloth, q8 and q4 and pretty much stick with the q4 because its faster.

[-]

needthosepylons@reddit

12gb vram (3060) and 32gb DDR4. Generally using Qweb3-8b, recently trying out MiniCPM4, actually performs better than Qwen3 on my own benchmark.

[-]

molbal@reddit

8GB VRAM + 48GB RAM, I used to run models in the 7-14b range, but lately I tend to pick Gemma3 4b, or Qwen3 1.7B.

Gemma is used for things like commit message generation, and the tiny qwen is for realtime one liner autocompletion.

For anything more complex, Qwen 30B runs too, but if the smaller models don't suffice it's easier to just reach for Gemini 2.5 for me via open router.

[-]

FullOf_Bad_Ideas@reddit

2x 24GB (3090 Ti). Qwen 3 32B FP8 and AWQ.

[-]

beedunc@reddit

Currently running 2x 5080Ti16s for 32GB, and it’s just awful.

I’m about to scrap it all and just get a Mac.

I can waste $3500 on another 32GB of vram, or get a Mac with 88GB(!) of ‘vram’ for about the same price.

Chasing vram with NVIDIA cards in this overpriced climate is a fool’s errand.

Mac all the way from now on.

[-]

EmPips@reddit (OP)

and it’s just awful

Curious what issues you're running into? I'm also at 32GB and it's been quite a mixed bag.

[-]

beedunc@reddit

Yes, mixed bag. I thought 32 would be the be-all and end-all, as most of my preferred models were 25-28GB.

I load them up (Ollama), and they lie! The ‘24GB’ model actually requires 40+ of vram, so - still swapping.

There’s no cheap way to add ‘more’ vram, as the PCIE slots are spoken for.

Swapping a 32GB for my 16 only nets me a 16Gb increase. For $3500!!!

Selling it and just buying an 88GB VRAM Mac for $2K - solved.

Good riddance, NVIDIA.

[-]

EmPips@reddit (OP)

I'm not a fan of modern prices either! But I'm definitely not swapping and I have a similar (2x16GB) configuration to yours.

Are you leaving ctx-size to default? Are you using flash attention? Quantizing cache?

[-]

beedunc@reddit

I don’t really know how to do that stuff, but I can you make enough of a difference to overcome a 15GB shortfall? Where do I find out more about those tweaks you point out?

The joke’s on me since I thought actual model size (in GB) was closely related to how much vram I needed. Doh!

[-]

Secure_Reflection409@reddit

Try Lmstudio. It's reasonably intuitive.

Start with 4096 context and make sure the flash attention box is ticked.

That's as close to native as you're gonna get. It can be tweaked further but start there.

[-]

beedunc@reddit

Been doing that, will look into it more. Thanks.

[-]

EmPips@reddit (OP)

I made a similar mistake early on and ended up needing to trade some 12GB cards in haha.

And yes actually. IIRC llama-cpp will use model defaults for context size(?), which for most modern models is >100k tokens (that's A TON).

If you're running llama.cpp and llama-server specifically:

...... --flash-attn --host 0.0.0.0 --cache-type-k q8_0 --cache-type-v q8_0 ----ctx-size 14000

-as an example if your use-case doesn't exceed 14,000 tokens (just play around with that a bit).

[-]

beedunc@reddit

I’m going to look into that, thanks!

[-]

BZ852@reddit

There’s no cheap way to add ‘more’ vram, as the PCIE slots are spoken for.

You can use some of the nvme slots to do just that FYI. You can also convert a PCI lane to multiple lanes too.

Would suck for anything latency sensitive, but thankfully LLMs are not that.

[-]

colin_colout@reddit

96gb very slow iGPU so I can run lots of things but slowly.

Qwen3's smaller MoE q4 is surprisingly fast at 2k context and slow but usable until about 8k.

It's a cheap mini pc and super low power. Since MoEs are damn fast and perform pretty well, I can't imagine an upgrade that is worth the cost.

[-]

Thedudely1@reddit

I'm running a 1080 Ti, for full GPU offload I run either Qwen 3 8B or Gemma 3 4B to get around 50 tokens/second. If I can wait, I'll do partial GPU offload with Qwen 3 30B-A3B or Gemma 3 27b (recently Magistral Small) to get around 5-15 tokens/second. I've been experimenting with keeping the KV cache in system ram instead of offloading it to VRAM in order to allow for much higher context lengths and slightly larger models to have all layers offloaded to the GPU.

[-]

techmago@reddit

ryzen 5800x
2x3090
128GB RAM
nvme for them odels.

i use qwen3:32b + nevoria (lamma3: 70b)

sometimes: qwen3:235b (is slow... but i can!)

[-]

NNN_Throwaway2@reddit

24GB Queen 3 30b a3b

[-]

ganonfirehouse420@reddit

Just set up my solution. My second PC got a 16gb vram gpu and 32gb ram. Running qwen3-30b-a3b so far till I find something better.

[-]

ttkciar@reddit

Usually I use my MI60 with 32GB of VRAM, but it's shut down for the summer, so I've been making do with pure-CPU inference. My P73 Thinkpad has 32GB of DDR4-2666, and my Dell T7910 has 256GB of DDR4-2133.

Some performance stats for various models here -- http://ciar.org/h/performance.html

I'm already missing the MI60, and am contemplating improving the cooling in my homelab, or maybe sticking a GPU into the remote colo server.

[-]

PraxisOG@reddit

2x rx6800 for 32gb vram and 48gb of ram. I usually use Gemma 3 27b qat4 to help me study, llama 3.3 70b iq3xxs when Gemma struggles to understand something, q4 qwen 3 30b/30b moe for coding. I've been experimenting with an iq2 version of qwen 3 235b, but between the low quant and 3.5tok/s speed it's not super useful.

[-]

mobileJay77@reddit

RTX 5090 with 32GB VRAM. I mostly run Mistral Small 3.1 @Q6, which leaves me with 48k context.

Otherwise I tend to mistral based devstral or reasoning. GLM works for code but failed with MCP.

[-]

MixChance@reddit

If you have 6GB or less VRAM and 16GB RAM, don’t go over 8B parameter models. Anything larger (especially models over 6GB in download size) will run very slow and feel sluggish during inference, And can damage your device overtime.

🔍 After lots of testing, I found the sweet spot for my setup is:

8B parameter models

Quantized to Q8_0, or sometimes FP16

Fast responses and stable performance, even on laptops

📌 My specs:

GTX 1660 Ti (mobile)

Intel i7, 6 cores / 12 threads

16GB RAM

Anything above 6GB in size for the model tends to slow things down significantly.

🧠 Quick explanation of quantization:
Think of it like compressing a photo. A high-res photo (like a 4000x4000 image) is like a huge model (24B, 33B, etc.). To run it on smaller devices, it needs to be compressed — that’s what quantization does. The more you compress (Q1, Q2...), the more quality you lose. Higher Q numbers like Q8 or FP16 offer better quality and responses but require more resources.

🔸 Rule of thumb:
Smaller models (like 8B) + higher float precision (Q8 or FP16) = best performance and coherence on low-end hardware.

If you really want to run larger models on small setups, you’ll need to use heavily quantized versions. They can give good results, but often they perform similarly to smaller models running at higher precision — and you miss out on the large model’s full capabilities anyway.

🧠 Extra Tip:
On the Ollama website, click “View all models” (top right corner) to see all available versions, including ones optimized for low-end devices.

💡 You do the math — based on my setup and results, you can estimate what models will run best on your machine too. Use this as a baseline to avoid wasting time with oversized models that choke your system.

[-]

freedom2adventure@reddit

daily driver llama-server --jinja -m ./model_dir/Llama-3.3-70B-Instruct-Q4_K_M.gguf --flash-attn --metrics --cache-type-k q8_0 --cache-type-v q8_0 --slots --samplers "temperature;top_k;top_p" --temp 0.1 -np 1 --ctx-size 131000 --n-gpu-layers 0

Running on raider ge66 64gb ddr5 12th gen i9, 3070 ti 8gb vram usually get .5-2 tokens/s, usually coherent to about 75k context before it is too slow to be useful.

[-]

Felladrin@reddit

32GB, MLX, Qwen3-14B-4bit-DWQ, 40K-context.

When starting a chat with 1k tokens in context:
- Time to first token: \~8s
- Tokens per second: \~24

When starting a chat with 30k tokens in context:
- Time to first token: \~300s
- Tokens per second: \~12

[-]

haagch@reddit

16 gb vram, 64 gb ram. I don't daily drive any model because everything that runs with usable speeds on this is more or less a toy.

I'm waiting until any consumer GPU company starts selling hardware that can run useful stuff on consumer PCs instead of wanting to force everyone to use cloud providers.

If the Radeon R9700 has a decent price I'll probably buy it but let's be real, 32 gb is still way too little. Once they make 128 gb GPUs for $1500 or so, then we can start talking.

[-]

HugoCortell@reddit

My AI PC has \~2GB Vram I think. It runs SmoLLM very well. I do not drive it daily because it's not very useful.

My workstation has 24GB but I don't use it for LLMs.

[-]

RelicDerelict@reddit

What are you using SmoLLMs models for?

[-]

AppearanceHeavy6724@reddit

20 GiB. Qwen3 30B-A3B coding, Mistral Nemo and Gemma 3 27B creative writing.

[-]

Western_Courage_6563@reddit

12gb, and it's mostly deepseek-r1 -qwen distill 8b. And other within 7 - 8b range

[-]

throw_me_away_201908@reddit

32GB unified memory, daily driver is Gemma3 27B Q4_K_M (mlabonne's abliterated GGUF) with 20k context. I get about 5.2t/s to start, drifting down to 4.2 as the context fills up.

[-]

pmv143@reddit

Running a few different setups, but mainly 48GB A6000s and 80GB H100s across a shared pool. Daily-driver models tend to be 13B (Mistral, LLaMA) with some swap-ins for larger ones depending on task.

We’ve been experimenting with fast snapshot-based model loading , aiming to keep cold starts under 2s even without persistent local storage. It’s been helpful when rotating models dynamically on shared GPUs.

[-]

SanDiegoDude@reddit

You just reminded me that my new AI box is coming in next week. 128GB of unified system ram on the new AMD architecture. Won't be crazy fast, but I'm looking forward to running 70B and 30B models on it.

[-]

unrulywind@reddit

RTX 4070ti 12gb and RTX 4060ti 16gb

All around use local:

gemma-3-27b-it-UD-Q4_K_XL

Llama-3_3-Nemotron-Super-49B-v1-IQ3_XS

Mistral-Small-3.1-24B-Instruct-2503-UD-Q4_K_XL

Coding, local, VS Code:

Devstral-Small-2505-UD-Q4_K_XL

Phi-4-reasoning-plus-UD-Q4_K_XL

Coding, refactoring, VS Code:

Claude 4

[-]

Long-Shine-3701@reddit

128GB VRAM across (2) Radeon Pro W6800x duo connected via Infinity Fabric. Looking to add (4) Radeon Pro VII with Infinity Fabric for an additional 64GB. Maybe an additional node after that. What interesting things could I run?

[-]

5dtriangles201376@reddit

16+12gb, run Snowdrop

[-]

AC1colossus@reddit

That is, you offload from your 16 of VRAM? How's the latency?

[-]

5dtriangles201376@reddit

Dual GPU 16gb + 12gb. It's actually really nice and although it would have been better to have gotten a 3090 when they were cheap I paid a bit less than what used ones go for now

[-]

AC1colossus@reddit

Ah yeah makes sense. Thanks.

[-]

Dicond@reddit

56gb VRAM (5090 + 3090), Qwen3 32b, QwQ 32b, Gemma3 27b have been my go to. I'm eagerly awaiting the release of a new, better ~70b model to run at q4-q5.

[-]

StandardLovers@reddit

48GB vram, 128GB ddr5. Mainly running qwen 3 32b q6 w/16000 context.

[-]

getfitdotus@reddit

Two dedicated ai machines 4xada6000 and 4x3090 3090s run qwen3-30b in bf16 with kokoro tts. Adas run qwen3-235B in gptq int4. Used mostly via apis . Also keep qwen0.6B embedding loaded. All with 128k context. 30B starts at 160t/s and 235B around 60t/s

[-]

jgenius07@reddit

24gb vram on an amd rx7900xtx Daily-ing a Gemma3:27b

[-]

EmPips@reddit (OP)

What quant and what t/s are you getting? I'm using dual 6800's right now and notice a pretty sharp drop in speed when splitting across two GPU's (llama-cpp rocm)

[-]

jgenius07@reddit

I'm consistently getting 20t/s. It's little 4bit quantised. I have it on a pcie5 slot but it runs in pcie4 speed

[-]

EmPips@reddit (OP)

That's basically identical to what I'm getting with the 6800's, something doesn't seem right here. You'd expect that 2x memory-bandwidth to show up somewhere.

What options are you giving it for ctx-size ? What quant are you running?

[-]

Ashefromapex@reddit

On my macbook pro with 128gb I mostly use qwen3 30b and 253b because of the speed. On my server i have a 3090 and am switching between glm4 for coding, qwen3-32b for general purpose.

[-]

TopGunFartMachine@reddit

~160GB total VRAM. Qwen3-235B-A22B. IQ4_XS quant. 128k context. ~200tps PP, ~15tps generation with minimal context, ~8tps generation at lengthy context.

[-]

No_Information9314@reddit

24GB VRAM on 2x 3060s, mainly use Qwen-30b

[-]

vulcan4d@reddit

42GB Vram with 3x P102-100 and 1x 3060. I run Gwen3 30b-a3b with a 22k context to fill the Vram.

[-]

makistsa@reddit

qwen3 235B q3 \~5.5t/s(starts at 5.75 and falls to 5.45) and qwen3 30B q6

128gb ddr4, 16gb vram

I am waiting for dots. It's the perfect size, if it's good.

[-]

Equivalent-Stuff-347@reddit

What’s dots? Search is failing me here

[-]

makistsa@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1l4mgry/chinas_xiaohongshurednote_released_its_dotsllm/

[-]

Equivalent-Stuff-347@reddit

Thanks! Open source MoE with 128 experts, top-6 routing, and 2 shared experts sounds lit

[-]

SaratogaCx@reddit

I think those are the little animated bounding ...'s when you are waiting for a response.

[-]

SplitYOLO@reddit

24GB VRAM and Qwen3 32B

[-]

opi098514@reddit

I have a 132 gigs of vram across 3 machines and I daily drive……… ChatGPT, GitHub copilots, Gemini, Jules, and Claude. I’m a poser I’m sorry. I use all my vram for my projects that use LLMs but they aren’t used for actual work.

[-]

ObscuraMirage@reddit

32gb and from Ollama Gemma3:12B mainly ( I par it sometimes with Gemma3:4b or Qwen2.5VL 7B) with Unsloths MistralSmall3.1:24B or Qwen3 30B for the big tasks.

Slowly moving toward llamacpp.

[-]

Zc5Gwu@reddit

I have 30gb vram across two gpus and generally run qwen3 30b at q4 and a 3b model for code completion on the second gpu.

[-]

getmevodka@reddit

i have up to 248gb vram and use either qwen3 235b a22b q4kxl with 128k context with 170-180gb in size whole.
or r1 0528 iq2xxs with 32k context with 230-240gb in size whole.

depends.

if i need speed i use qwen3 30b a3b q8kxl with 128k context - dont know the whole size of that tbh. its small and fast though lol.

[-]

findingsubtext@reddit

72GB (2x 3090, 2x 3060). I run Gemma3 27B because it’s fast and doesn’t hold my entire workstation hostage.

[-]

maverick_soul_143747@reddit

Testing out Qwen 3 32B locally on my macbook pro