How much VRAM do you have and what's your daily-driver model?
Posted by EmPips@reddit | LocalLLaMA | View on Reddit | 173 comments
Curious what everyone is using day to day, locally, and what hardware they're using.
If you're using a quantized version of a model please say so!
segmond@reddit
daily driver deepseek-r1-0528, qwen3-253b and then whatever other models I happen to run, often keep gemma3-27b going for simple tasks that need fast reply. 425gb vram across 3 nodes.
Pedalnomica@reddit
Damn... What GPUs do you have.
segmond@reddit
lol, I actually counted it's 468gb
but 7 24gb 3090's, 1 12gb 3080ti, 2 12gb 3060, 3 24gb P40s, 2 16gb V100 10 16gb MI50
Expensive-Apricot-25@reddit
How fast does deepseek run?
Pedalnomica@reddit
And here I am... slummin' it with a mere 10x 3090s and 1x 12gb 3060...
PermanentLiminality@reddit
I want to buy stock in your electric utilities.
Pedalnomica@reddit
The plan is to keep the 3060 always running and ready. I'll only power up the 3090s when I'm using the big models. That's the plan anyway...
Maximum-Health-600@reddit
Get solar
BhaiBaiBhaiBai@reddit
What does that make me then, running Qwen3 30B A3B on my ThinkPad's Intel Iris Xe?
Pedalnomica@reddit
A hustler?
Pedalnomica@reddit
A hustler?
Z3r0_Code@reddit
Me crying in the corner with my 4gb 1650. š„²
FormalAd7367@reddit
Wow how much did it cost you for that build?
segmond@reddit
Less than an apple M3 studio with 512gb.
FormalAd7367@reddit
iām not jealous.. was it originally a Bitcoin mining motherboardā¦?
segmond@reddit
1 of the node is a mining server with 12 pcie slots, the others are dual x99 boards with 6 pcie slots. if you click on my profile you can see the pinned post of my server builds.
FormalAd7367@reddit
thanks - wish i saw your pinned post few months ago. i built mine so much more expensive
Pedalnomica@reddit
But seriously, I'm curious how the multi node inference forĀ deepseek-r1-0528 works, especially with all those different GPU types.
ICanSeeYourPixels0_0@reddit
I run the same on a M3 Max 32GB MacBook Pro along with VSCode
Pedalnomica@reddit
0.4 bpw?
Hoodfu@reddit
same, although I've given up on the qwen3 because r1 0528 beats it by a lot. gemma3-27b like you for everything else including vision. I also keep the 4b around which open-webui uses for tagging and summarizing each chat very quickly. m3 ultra 512.
false79@reddit
I am looking to get m3 ultra 512GB. do you find it's overkill for models you find the most useful? Or do you have any regret where you got have a cheaper hardware configuration more fine tuned to what you do most often?
Hoodfu@reddit
I have the means to splurge on such a thing, so I'm loving that it lets me run such a model at home. It's hard to justify though unless a one time expense like that is easily within your budget. It doesn't run any models particularly fast, it's more just that you can at all. I'm usually looking at about 16-18 t/s on these models. qwen 235b was faster because it's activate parameters was less than gemma 27b. something to also consider is the upcoming rtx 6000 pro that might be in the same price range but probably around double the speed if youre fine with models inside of 96 gigs of ram.
segmond@reddit
r1-0528 is so good, i'm willing to wait through the thinking process. I use it for easily 60% of my needs.
After-Cell@reddit
Whatās your method to use it while away from home?
segmond@reddit
private vpn, I can access it from any personal device, laptop, tablet & phone included.
After-Cell@reddit
Doesnāt that lag out everything else? Or you have a way to selectively apply the VPN on the phone?
tutami@reddit
How do you handle models not being up to date?
hak8or@reddit
Are you doing inference using llama.cpp's RPC functionality, or something else?
segmond@reddit
not anymore, with offloading of tensors, I can get more out of the GPUs. deepseek on one node, qwen3 on another, then a mixture of smaller models on the other.
RenlyHoekster@reddit
3 Nodes: how are you connecting them, with Ray for example?
Easy_Kitchen7819@reddit
7900xtx Qwen 3 32B 4qxl
zubairhamed@reddit
640KB ought to be enough for anybody...
....but i do have 24GB
mobileJay77@reddit
640K is enough for every human š
This also goes to show, how much the demand in computing fills up all gains of productivity and Moore's law. Why should we need less developers?
stoppableDissolution@reddit
We will also need more developers if compute scaling slows down. Someone will have to refactor all the bloatware written when getting more compute was cheaper than hiring someone familiar with performance optimizations
StandardPen9685@reddit
Mac mini M4 pro 64gb. Gemma3:12b
IllllIIlIllIllllIIIl@reddit
I have 8MB of VRAM on a 3dfx Voodoo2 card and I'm running a custom trigram hidden markov model that outputs nothing but Polish curses.
pixelkicker@reddit
*but heavily quantized so sometimes they are in German
kremlinhelpdesk@reddit
I just run the magistral-small-2506-zdunk, produces all sorts or functional hexes and curses in polish. Would probably help to run it on esoteric hardware, though.
techmago@reddit
you should try templeOS then
Zengen117@reddit
All of the upvotes for templeOS XD
pun_goes_here@reddit
RIP 3dfx :(
jmprog@reddit
kurwa
notwhobutwhat@reddit
Qwen3-32B-AWQ across two 5060's, Gemma3-12B-QAT on a 4070, and BGE3 embedder/reranker on an old 3060 I had lying around. Just running them all in an old gaming rig I had lying around, i9900k with 64GB, using OpenWebUI on the front end. Also running Perplexica and GPT Researcher on th same box.
Getting 35t/s on Qwen3-32B, which is plenty for help with work related content creation, and using MCP tools to plug any knowledge gaps or verify latest info.
The_Crimson_Hawk@reddit
Llama 4 maverick on cpu, 8 channel ddr5 5600
Hurricane31337@reddit
EPYC 7713 with 4x 128 GB DDR4-2933 with 2x RTX A6000 48 GB -> 512 GB RAM with 96 GB VRAM
Using mostly Qwen 3 30B in Q8_K_XL with 128K tokens context. Sometimes Qwen 3 235B in Q4_K_XL but most of the time the slowness compared to 30B isnāt worth it for me.
BeeNo7094@reddit
How much was that 128Gb ram? Youāre not utilising 4 channels to be able to expand to 1TB later?
Hurricane31337@reddit
I paid 765⬠including shipping for all four sticks.
Yes, when I got them, DeepSeek V3 just came out and I wasnāt sure if even larger models will come out. 1500⬠was definitely over my spending limit but who knows, maybe I can snatch a deal in the future. š¤
BeeNo7094@reddit
765 eur definitely is a bargain compared to the quotes I have received here in India. Do you have any CPU inference numbers for ds q4 or any unsloth dynamic quants? Using ktransformers? Multi GPU helps with ktransformers?
What motherboard?
Hurricane31337@reddit
Sorry Iām not at home currently, I can do it on Monday. Currently Iām using Windows 11 only though (because of my company, was too lazy to setup Unix dual boot).
BeeNo7094@reddit
Hey, let me know if you get the time to do this.
eatmypekpek@reddit
How are you liking the 512gb of RAM? Are you satisfied with the quality at 235B (even if slower)? Lastly, what kinda tps are you getting at 235B Q4?
I'm in the process of making a Threadripper build and trying to decide if I should get 256gb, 512gb, or fork over the money for 1tb of DDR4 RAM.
Hurricane31337@reddit
Sorry Iām not at home currently, I can measure it on Monday. Currently Iām using Windows 11 only though (because of my company, was too lazy to setup Unix dual boot). If I remember correctly, Qwen 3 235B Q4_K_XL was like 2-3 tps, so definitely very slow (especially with thinking activated). Qwen 3 30B Q8_K_XL is more than 30 tps (or even faster) and mostly works just as well, so Iām always using 30B and rarely, if 30B spits out nonsense, I switch to 235B in the same chat and let it answer a few messages 30B wasnāt able to answer (better slow than nothing).
Dismal-Cupcake-3641@reddit
I have 12 GB Vram I generally use the quantized version of Gemma 12B in the interface I developed. I also added a memory system and it works very well.
Zengen117@reddit
I'm running the same setup. Gemma3:12b-qat RTX 3060 with 12GB VRAM and I use open-webui for remote accessible interface.
DrAlexander@reddit
With 12Gb VRAM I also mainly stuck to the 8-12b q4 models, but lately I've found that I can also live with the 5tok/s from gemma3 27B if I just need 3-4 answers or I set up a proper pipeline for appropriately chunked text assessment and leave it running overnight.
Hopefully soon I'll be able to get one of those 24GB 3090s and be in league with the bigger small boys!
Dismal-Cupcake-3641@reddit
Yes, now we both need big VRAMs. But I think about what could be different every day. I want to do something that will make even a 2D or 4D model an expert in a specific field and give much better results than large models.
After-Cell@reddit
Please give me a keyword to investigate the memoryĀ
And also,
How do you access it when not at home on site?
Dismal-Cupcake-3641@reddit
I rented a vps, I make an api call to it, and since it is connected to my computer at home via an ssh tunnel, it makes an api call to my computer at home, gets the response and sends it to me. I developed a simple memory system for myself, each conversation is also recorded, so the model can remember what I'm talking about and continue where it left off.
After-Cell@reddit
Great approach! Iāll investigate for sureĀ
Dismal-Cupcake-3641@reddit
Thanks :)
fahdhilal93@reddit
are you using a 3060?
Dismal-Cupcake-3641@reddit
Yes RTX 3060 12GB.
Judtoff@reddit
I peaked at 4 p40 and a 3090, 120GB. Used mistral large 2. Now that gemma3 27b is out I've sold my p40s and im using two 3090s. Quantized to 8 bits and using 26000 context. Planning on 4 3090s eventually for 131k context.Ā
No-Statement-0001@reddit
i tested llama-server, SWA up to 80K context and it fit my on dual 3090s with no kv quant. With q8, pretty sure it can get up to the full 128K.
Wrote up findings here: https://github.com/mostlygeek/llama-swap/wiki/gemma3-27b-100k-context
Judtoff@reddit
I'll have to check this out. I've got the third 3090 in the mail, but avoiding a fourth would save me some headaches. Even if the third ends up being technically unnecessary, I'd like some space to run TTS and SST and a diffusion model (like SDXL), so the third won't be a complete waste. Thanks for sharing!
After-Cell@reddit
How do you use it when not at home in front of it ?
No-Statement-0001@reddit
wireguard vpn.
Klutzy-Snow8016@reddit
For Gemma 3 27b, you can get the full 128k context (with no kv cache quant needed) with BF16 weights with just 3 3090s.
Judtoff@reddit
Oh fantastic haha, I've got my third 3090 in the mail and fitting the fourth was going to be a nightmare (I would need a riser), this is excellent news. Thank you!
Eden1506@reddit
mistral 24b on my steam deck at around 4 tokens/s
LA_rent_Aficionado@reddit
I still use APIs more for a lot of uses with Cursor but when I run locally on 96gb vram -
Qwen3 235B A22 Q_3 at 64k context Q_4 kv cache Qwen 32B Dense Q_8 at 132k context
ExtremeAcceptable289@reddit
8gb rx 6600m, 16gb system ram. i (plan to) main qwen3 30b moe
relmny@reddit
The monthly "how much VRAM and what model" post, which is fine, because these things change a lot.
With 16gb VRAM/128gb RAM, qwen3-14b, and 30b. If I need more 235b and if I really need more/the best, deepseek-r1-0528
With 32gb VRAM/128gb RAM the above except the 32b instead of 14b. The rest is the same.
Dyonizius@reddit
same here, how are you running the huge moe's?
*pondering on a ram upgrade
relmny@reddit
-m ../models/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ot ".ffn_.*_exps.=CPU" -c 16384 -n 16384 --prio 2 -t 4 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 -ngl 99 -fa
offloading the MoE to CPU (RAM)
And this is deepseek-r1 (about 0.73t/s) but with ik_llama.cpp (instead of vanilla llama.cpp), although I "disable" thinking usually, but I only run it IF I really need to.
-m ../models/huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf --ctx-size 12288 -ctk q8_0 -mla 3 -amb 512 -fmoe -ngl 63 --parallel 1 --threads 5 -ot ".ffn_.*_exps.=CPU" -fa
Dyonizius@reddit
for 32Gb vram try this
in addition, use all physical cores on moesĀ
for some reason it scales linearlyĀ
MidnightHacker@reddit
What quant are you using for R1? I have 88Gb of RAM, thinking abut upgrading to 128Gb
relmny@reddit
ubergarm/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf
but I only get about 0.73t/s with ik_llama.cpp.
Dyonizius@reddit
the _R4 is pre-repacked so you're probably not offloading all possible layers right?
Weary_Long3409@reddit
3x3060. Two are running Qwen3-8B-w8a8, and the other one is running Qwen2.5-3B-Instruct-w8a8, embedding model, and whisper-large-v3-turbo.
Mostly for classification, text similarity, comparison, transcription, and it's automation. Those which running 8B are old badass serving concurrent request, prompt processing at peak 12,000-13,000 tokens/sec.
ATyp3@reddit
I have a question. What do you guys actually USE the LLMs for?
I just got a beefy m4 MBP with 48 gigs of RAM and really only want 2 models. One for raycast so I can ask quick questions and one for āvibe codingā. I just want to know.
Maykey@reddit
16GB. unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF for local
p4s2wd@reddit
2080TI 22G * 7 + 3090 * 1, there are total 178G VRAM.
It's running Deepseek-v4-0324-UD-Q2 + Qwen3-32B bf16.
Goldkoron@reddit
96gb VRAM across 2 3090s and a 48gb 4090D
However, I still use Gemma3-27b mostly, it feels like one of the best aside from the huge models that are still out of reach.
Dead-Photographer@reddit
I'm doing gemma 3 27b and qwen3 32b q4 or q8 depending on the use case, 80gb RAM + 24gb VRAM (2 3060s)
DAlmighty@reddit
Oh I love these conversations to remind me that Iām GPU poor!
norman_h@reddit
352gb vram across multiple nodes...
DeepSeek 70b model locally... Also injecting DNA from gemini 2.5 pro. Unsure if I'll go ultra yet...
Frankie_T9000@reddit
0 vram 512gb ram (machine has 4060 to but don't use it for this llm). Deepseek q3_k_l
Zengen117@reddit
Honestly. Im running Gemma3-it-qat 12b on a gaming rig with an RTX 3060 (12GB VRAM). With a decent system prompt and a search engine API key in open-webui its pretty dam good for general purpose stuff. Its not gonna be suitable if your a data scientist, if you wana crunch massive amounts of data or do alot with image/video. But for modest general AI use, question and answer, quick web search summaries etc, it gets the job done pretty good. The accuracy benefit with the QAT models on my kind of hardware is ENORMOUS as well.
fizzy1242@reddit
72 vram across three 3090s. I like mistral large 2407
FormalAd7367@reddit
Why do you prefer mistral large over deep seek? Iām running 4 x 3090.
fizzy1242@reddit
Would be too large to fit.
RedwanFox@reddit
Hey, what motherboard do you use? Or is it distributed setup?
fizzy1242@reddit
board is Asus rog crosshair viii dark hero x570. all in one case
Ok_Agency8827@reddit
Do you need the NVLink peripheral, or does the motherboard handle the SLI? Also, what power supply do you use? I don't really understand about how to SLI these GPUs for multi GPU use.
fizzy1242@reddit
No nvlink, it's not necessary. my psu is 1500W, but I still powerlimit gpus to keep thermals under control
RedwanFox@reddit
Thanks!
candre23@reddit
I also have three 3090s and have moved from largestral tunes to CMD-A tunes.
fizzy1242@reddit
I liked command A too, but i'm pretty sure exl2 doesn't support it yet unfortunately. Tensor splitting it in llamacpp isn't very fast
Zc5Gwu@reddit
Curious about your experience with mistral large. What do you like about it, speed, compared to other models?
fizzy1242@reddit
i like how it writes, it's not as robotic in conversing in my opinion. speed is good enough at 15t/s with exl2
Mescallan@reddit
M1 MacBook air, 16gig ram
Gemma 4b is my work horse because I can run it in the background doing classification stuff. I chat with Claude, and use Claude code and cursor for coding.
ArchdukeofHyperbole@reddit
6 gigabytes. Qwen 30B. I use online models as well but not nearly as much nowadays
philmarcracken@reddit
is that unsloth? using lm studio or something else?
ArchdukeofHyperbole@reddit
Lm studio and sometimes use a python wrapper of llama.cpp, easy_llama.
I grabbed a few versions of the 30B from unsloth, q8 and q4 and pretty much stick with the q4 because its faster.
needthosepylons@reddit
12gb vram (3060) and 32gb DDR4. Generally using Qweb3-8b, recently trying out MiniCPM4, actually performs better than Qwen3 on my own benchmark.
molbal@reddit
8GB VRAM + 48GB RAM, I used to run models in the 7-14b range, but lately I tend to pick Gemma3 4b, or Qwen3 1.7B.
Gemma is used for things like commit message generation, and the tiny qwen is for realtime one liner autocompletion.
For anything more complex, Qwen 30B runs too, but if the smaller models don't suffice it's easier to just reach for Gemini 2.5 for me via open router.
FullOf_Bad_Ideas@reddit
2x 24GB (3090 Ti). Qwen 3 32B FP8 and AWQ.
beedunc@reddit
Currently running 2x 5080Ti16s for 32GB, and itās just awful.
Iām about to scrap it all and just get a Mac.
I can waste $3500 on another 32GB of vram, or get a Mac with 88GB(!) of āvramā for about the same price.
Chasing vram with NVIDIA cards in this overpriced climate is a foolās errand.
Mac all the way from now on.
EmPips@reddit (OP)
Curious what issues you're running into? I'm also at 32GB and it's been quite a mixed bag.
beedunc@reddit
Yes, mixed bag. I thought 32 would be the be-all and end-all, as most of my preferred models were 25-28GB.
I load them up (Ollama), and they lie! The ā24GBā model actually requires 40+ of vram, so - still swapping.
Thereās no cheap way to add āmoreā vram, as the PCIE slots are spoken for.
Swapping a 32GB for my 16 only nets me a 16Gb increase. For $3500!!!
Selling it and just buying an 88GB VRAM Mac for $2K - solved.
Good riddance, NVIDIA.
EmPips@reddit (OP)
I'm not a fan of modern prices either! But I'm definitely not swapping and I have a similar (2x16GB) configuration to yours.
Are you leaving
ctx-size
to default? Are you using flash attention? Quantizing cache?beedunc@reddit
I donāt really know how to do that stuff, but I can you make enough of a difference to overcome a 15GB shortfall? Where do I find out more about those tweaks you point out?
The jokeās on me since I thought actual model size (in GB) was closely related to how much vram I needed. Doh!
Secure_Reflection409@reddit
Try Lmstudio. It's reasonably intuitive.
Start with 4096 context and make sure the flash attention box is ticked.
That's as close to native as you're gonna get. It can be tweaked further but start there.
beedunc@reddit
Been doing that, will look into it more. Thanks.
EmPips@reddit (OP)
I made a similar mistake early on and ended up needing to trade some 12GB cards in haha.
And yes actually. IIRC llama-cpp will use model defaults for context size(?), which for most modern models is >100k tokens (that's A TON).
If you're running
llama.cpp
andllama-server
specifically:-as an example if your use-case doesn't exceed 14,000 tokens (just play around with that a bit).
beedunc@reddit
Iām going to look into that, thanks!
BZ852@reddit
You can use some of the nvme slots to do just that FYI. You can also convert a PCI lane to multiple lanes too.
Would suck for anything latency sensitive, but thankfully LLMs are not that.
colin_colout@reddit
96gb very slow iGPU so I can run lots of things but slowly.
Qwen3's smaller MoE q4 is surprisingly fast at 2k context and slow but usable until about 8k.
It's a cheap mini pc and super low power. Since MoEs are damn fast and perform pretty well, I can't imagine an upgrade that is worth the cost.
Thedudely1@reddit
I'm running a 1080 Ti, for full GPU offload I run either Qwen 3 8B or Gemma 3 4B to get around 50 tokens/second. If I can wait, I'll do partial GPU offload with Qwen 3 30B-A3B or Gemma 3 27b (recently Magistral Small) to get around 5-15 tokens/second. I've been experimenting with keeping the KV cache in system ram instead of offloading it to VRAM in order to allow for much higher context lengths and slightly larger models to have all layers offloaded to the GPU.
techmago@reddit
ryzen 5800x
2x3090
128GB RAM
nvme for them odels.
i use qwen3:32b + nevoria (lamma3: 70b)
sometimes: qwen3:235b (is slow... but i can!)
NNN_Throwaway2@reddit
24GB Queen 3 30b a3b
ganonfirehouse420@reddit
Just set up my solution. My second PC got a 16gb vram gpu and 32gb ram. Running qwen3-30b-a3b so far till I find something better.
ttkciar@reddit
Usually I use my MI60 with 32GB of VRAM, but it's shut down for the summer, so I've been making do with pure-CPU inference. My P73 Thinkpad has 32GB of DDR4-2666, and my Dell T7910 has 256GB of DDR4-2133.
Some performance stats for various models here -- http://ciar.org/h/performance.html
I'm already missing the MI60, and am contemplating improving the cooling in my homelab, or maybe sticking a GPU into the remote colo server.
PraxisOG@reddit
2x rx6800 for 32gb vram and 48gb of ram. I usually use Gemma 3 27b qat4 to help me study, llama 3.3 70b iq3xxs when Gemma struggles to understand something, q4 qwen 3 30b/30b moe for coding. I've been experimenting with an iq2 version of qwen 3 235b, but between the low quant and 3.5tok/s speed it's not super useful.
mobileJay77@reddit
RTX 5090 with 32GB VRAM. I mostly run Mistral Small 3.1 @Q6, which leaves me with 48k context.
Otherwise I tend to mistral based devstral or reasoning. GLM works for code but failed with MCP.
MixChance@reddit
If you have 6GB or less VRAM and 16GB RAM, donāt go over 8B parameter models. Anything larger (especially models over 6GB in download size) will run very slow and feel sluggish during inference, And can damage your device overtime.
š After lots of testing, I found the sweet spot for my setup is:
8B parameter models
Quantized to Q8_0, or sometimes FP16
Fast responses and stable performance, even on laptops
š My specs:
GTX 1660 Ti (mobile)
Intel i7, 6 cores / 12 threads
16GB RAM
Anything above 6GB in size for the model tends to slow things down significantly.
š§ Quick explanation of quantization:
Think of it like compressing a photo. A high-res photo (like a 4000x4000 image) is like a huge model (24B, 33B, etc.). To run it on smaller devices, it needs to be compressed ā thatās what quantization does. The more you compress (Q1, Q2...), the more quality you lose. Higher Q numbers like Q8 or FP16 offer better quality and responses but require more resources.
šø Rule of thumb:
Smaller models (like 8B) + higher float precision (Q8 or FP16) = best performance and coherence on low-end hardware.
If you really want to run larger models on small setups, youāll need to use heavily quantized versions. They can give good results, but often they perform similarly to smaller models running at higher precision ā and you miss out on the large modelās full capabilities anyway.
š§ Extra Tip:
On the Ollama website, click āView all modelsā (top right corner) to see all available versions, including ones optimized for low-end devices.
š” You do the mathĀ ā based on my setup and results, you can estimate what models will run best on your machine too. Use this as a baseline to avoid wasting time with oversized models that choke your system.
freedom2adventure@reddit
daily driver llama-server --jinja -m ./model_dir/Llama-3.3-70B-Instruct-Q4_K_M.gguf --flash-attn --metrics --cache-type-k q8_0 --cache-type-v q8_0 --slots --samplers "temperature;top_k;top_p" --temp 0.1 -np 1 --ctx-size 131000 --n-gpu-layers 0
Running on raider ge66 64gb ddr5 12th gen i9, 3070 ti 8gb vram usually get .5-2 tokens/s, usually coherent to about 75k context before it is too slow to be useful.
Felladrin@reddit
32GB, MLX, Qwen3-14B-4bit-DWQ, 40K-context.
When starting a chat with 1k tokens in context:
- Time to first token: \~8s
- Tokens per second: \~24
When starting a chat with 30k tokens in context:
- Time to first token: \~300s
- Tokens per second: \~12
haagch@reddit
16 gb vram, 64 gb ram. I don't daily drive any model because everything that runs with usable speeds on this is more or less a toy.
I'm waiting until any consumer GPU company starts selling hardware that can run useful stuff on consumer PCs instead of wanting to force everyone to use cloud providers.
If the Radeon R9700 has a decent price I'll probably buy it but let's be real, 32 gb is still way too little. Once they make 128 gb GPUs for $1500 or so, then we can start talking.
HugoCortell@reddit
My AI PC has \~2GB Vram I think. It runs SmoLLM very well. I do not drive it daily because it's not very useful.
My workstation has 24GB but I don't use it for LLMs.
RelicDerelict@reddit
What are you using SmoLLMs models for?
AppearanceHeavy6724@reddit
20 GiB. Qwen3 30B-A3B coding, Mistral Nemo and Gemma 3 27B creative writing.
Western_Courage_6563@reddit
12gb, and it's mostly deepseek-r1 -qwen distill 8b. And other within 7 - 8b range
throw_me_away_201908@reddit
32GB unified memory, daily driver is Gemma3 27B Q4_K_M (mlabonne's abliterated GGUF) with 20k context. I get about 5.2t/s to start, drifting down to 4.2 as the context fills up.
pmv143@reddit
Running a few different setups, but mainly 48GB A6000s and 80GB H100s across a shared pool. Daily-driver models tend to be 13B (Mistral, LLaMA) with some swap-ins for larger ones depending on task.
Weāve been experimenting with fast snapshot-based model loading , aiming to keep cold starts under 2s even without persistent local storage. Itās been helpful when rotating models dynamically on shared GPUs.
SanDiegoDude@reddit
You just reminded me that my new AI box is coming in next week. 128GB of unified system ram on the new AMD architecture. Won't be crazy fast, but I'm looking forward to running 70B and 30B models on it.
unrulywind@reddit
RTX 4070ti 12gb and RTX 4060ti 16gb
All around use local:
gemma-3-27b-it-UD-Q4_K_XL
Llama-3_3-Nemotron-Super-49B-v1-IQ3_XS
Mistral-Small-3.1-24B-Instruct-2503-UD-Q4_K_XL
Coding, local, VS Code:
Devstral-Small-2505-UD-Q4_K_XL
Phi-4-reasoning-plus-UD-Q4_K_XL
Coding, refactoring, VS Code:
Claude 4
Long-Shine-3701@reddit
128GB VRAM across (2) Radeon Pro W6800x duo connected via Infinity Fabric. Looking to add (4) Radeon Pro VII with Infinity Fabric for an additional 64GB. Maybe an additional node after that. What interesting things could I run?
5dtriangles201376@reddit
16+12gb, run Snowdrop
AC1colossus@reddit
That is, you offload from your 16 of VRAM? How's the latency?
5dtriangles201376@reddit
Dual GPU 16gb + 12gb. It's actually really nice and although it would have been better to have gotten a 3090 when they were cheap I paid a bit less than what used ones go for now
AC1colossus@reddit
Ah yeah makes sense. Thanks.
Dicond@reddit
56gb VRAM (5090 + 3090), Qwen3 32b, QwQ 32b, Gemma3 27b have been my go to. I'm eagerly awaiting the release of a new, better ~70b model to run at q4-q5.
StandardLovers@reddit
48GB vram, 128GB ddr5. Mainly running qwen 3 32b q6 w/16000 context.
getfitdotus@reddit
Two dedicated ai machines 4xada6000 and 4x3090 3090s run qwen3-30b in bf16 with kokoro tts. Adas run qwen3-235B in gptq int4. Used mostly via apis . Also keep qwen0.6B embedding loaded. All with 128k context. 30B starts at 160t/s and 235B around 60t/s
jgenius07@reddit
24gb vram on an amd rx7900xtx Daily-ing a Gemma3:27b
EmPips@reddit (OP)
What quant and what t/s are you getting? I'm using dual 6800's right now and notice a pretty sharp drop in speed when splitting across two GPU's (llama-cpp rocm)
jgenius07@reddit
I'm consistently getting 20t/s. It's little 4bit quantised. I have it on a pcie5 slot but it runs in pcie4 speed
EmPips@reddit (OP)
That's basically identical to what I'm getting with the 6800's, something doesn't seem right here. You'd expect that 2x memory-bandwidth to show up somewhere.
What options are you giving it for
ctx-size
? What quant are you running?Ashefromapex@reddit
On my macbook pro with 128gb I mostly use qwen3 30b and 253b because of the speed. On my server i have a 3090 and am switching between glm4 for coding, qwen3-32b for general purpose.Ā
TopGunFartMachine@reddit
~160GB total VRAM. Qwen3-235B-A22B. IQ4_XS quant. 128k context. ~200tps PP, ~15tps generation with minimal context, ~8tps generation at lengthy context.
No_Information9314@reddit
24GB VRAM on 2x 3060s, mainly use Qwen-30b
vulcan4d@reddit
42GB Vram with 3x P102-100 and 1x 3060. I run Gwen3 30b-a3b with a 22k context to fill the Vram.
makistsa@reddit
qwen3 235B q3 \~5.5t/s(starts at 5.75 and falls to 5.45) and qwen3 30B q6
128gb ddr4, 16gb vram
I am waiting for dots. It's the perfect size, if it's good.
Equivalent-Stuff-347@reddit
Whatās dots? Search is failing me here
makistsa@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1l4mgry/chinas_xiaohongshurednote_released_its_dotsllm/
Equivalent-Stuff-347@reddit
Thanks! Open source MoE with 128 experts, top-6 routing, and 2 shared experts sounds lit
SaratogaCx@reddit
I think those are the little animated bounding ...'s when you are waiting for a response.
SplitYOLO@reddit
24GB VRAM and Qwen3 32B
opi098514@reddit
I have a 132 gigs of vram across 3 machines and I daily driveā¦ā¦ā¦ ChatGPT, GitHub copilots, Gemini, Jules, and Claude. Iām a poser Iām sorry. I use all my vram for my projects that use LLMs but they arenāt used for actual work.
ObscuraMirage@reddit
32gb and from Ollama Gemma3:12B mainly ( I par it sometimes with Gemma3:4b or Qwen2.5VL 7B) with Unsloths MistralSmall3.1:24B or Qwen3 30B for the big tasks.
Slowly moving toward llamacpp.
Zc5Gwu@reddit
I have 30gb vram across two gpus and generally run qwen3 30b at q4 and a 3b model for code completion on the second gpu.
getmevodka@reddit
i have up to 248gb vram and use either qwen3 235b a22b q4kxl with 128k context with 170-180gb in size whole.
or r1 0528 iq2xxs with 32k context with 230-240gb in size whole.
depends.
if i need speed i use qwen3 30b a3b q8kxl with 128k context - dont know the whole size of that tbh. its small and fast though lol.
tta82@reddit
128GB M2 Ultra and 3090 24GB on an i9 PC.
EasyConference4177@reddit
I got 144gb, 2x3090 turbos 24gb each and 2x quadro 8000s 48gb each⦠but honestly if you can access 24gb and Gemma 3 27b thatās all you need. Iām just ab enthusiast for it and want to eventually build my own company on AI llm
IkariDev@reddit
Dans pe 1.3.0 36 gb vram + 8gb vram on my server 16gb ram + 16gb ram on my server
marketlurker@reddit
llama3.2 but playing with llama4. I run a dell 7780 laptop with 16 GB VRAM and 128 Gb RAM
vegatx40@reddit
24g, gemma3:27b
plztNeo@reddit
128Gb unified memory. For speed I'm leaning towards Gemma 3 27B, or Qwen3 32B.
Anything chunky tend towards Llama 3.3 70B
findingsubtext@reddit
72GB (2x 3090, 2x 3060). I run Gemma3 27B because itās fast and doesnāt hold my entire workstation hostage.
maverick_soul_143747@reddit
Testing out Qwen 3 32B locally on my macbook pro