To 16GB VRAM users, plug in your old GPU
Posted by akira3weet@reddit | LocalLLaMA | View on Reddit | 95 comments
For those who want to run latest dense \~30b models and only have 16GB VRAM, if you have a old card with 6GB VRAM or more, plug it in.
It matters that everything fits on the VRAM, even on 2 cards. Even if one of them is quite weak.
I have a 5070Ti 16GB and a old 2060 6GB. The common idea is you need 2 same GPU to maximize performance. But one day I was strike by the idea, why not give it a try?
Let's see, if you did not bought a mother board just for LLM, it's very possible you have a true PCI-E x16 slot and a couple that looks like x16 but are actually wired with x4, just like me. That's a perfect slot for a old card.
16GB + 6GB = 22GB, it's getting close to the 24GB class card. If you have a better old card, lucky you!
Then you use llama-server with a config like this
[*]
jinja = true
cache-prompt = true
n-gpu-layers = 999
no-mmap = true
mlock = false
np = 1
t = 0
[qwen/qwen3.6-27b]
model = ./Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf
mmproj = ./Qwen3.6-27B-GGUF/mmproj-Qwen3.6-27B-BF16.gguf
reasoning = on
dev = Vulkan1,Vulkan2
c = 128000
no-mmproj-offload = true
cache-type-k = q8_0
cache-type-v = q8_0
A couple specific points:
- dev=Vulkan1,Vulkan2, this enables the two GPUs, run `llama-server.exe --list-devices` to see what you should set.
- no-mmap and mlock=false keeps the model away from your RAM
- np=1, no-mmproj-offload (or do not supply mmproj model), cache-type-k and cache-type-v to minimize VRAM needed
- n-gpu-layers=999 to prefer GPU offloading, well this may be unnecessary, but I'd keeps it
- split-mode=layer to split the layers asymmetrically across the device, "layer" is the default though so you don't see it above.
- c=128000 could be a little stretch, but works well enough for me.
BTW I also have intel integrated GPU that I plugged the monitors into.
Some numbers, basically, at 128k max context, 71k actual context useage, pp=186t/s and tg=19t/s, quite usable speed.
[56288] prompt eval time = 5761.53 ms / 1076 tokens ( 5.35 ms per token, 186.76 tokens per second)
[56288] eval time = 58000.15 ms / 1114 tokens ( 52.06 ms per token, 19.21 tokens per second)
[56288] total time = 63761.69 ms / 2190 tokens
[56288] slot release: id 0 | task 654 | stop processing: n_tokens = 71703, truncated = 0
jacek2023@reddit
Yes, every VRAM will be faster than RAM, I use 3060 as a extra bonus to my 3x3090, but I enable it only for biggest models
Ready_Performance_35@reddit
how do you plug 4 cards in one desktop?
UnlikelyPotato@reddit
Need a motherboard that supports it, likely riser cables (extension pci-e cords), case or chassis that holds it all, and possibly more than one PSU. Realistically, if a motherboard has 4x PCIe slots, it typically will support 4x PCIe devices.
redditpad@reddit
I was about to write a blog post (my first ever) about a similar concept but mixing CUDA/ROCm via RPC but similar idea. Basically my view is you can replicate a 24GB card for cheap - maybe 40-50% of the performance for 20% of the cost, and the splitting works - you should need to do testing on exactly where the limit is, it isn't perfect.
Prize_Weird_603@reddit
Is there any hope for 5060 ti and rx 7600 ?
tmvr@reddit
Why are you using Vulkan with a 5070Ti and a 2060? Use CUDA.
akira3weet@reddit (OP)
I still don't get enough VRAM for a long context using CUDA unfortunately.
jwpbe@reddit
this is a nonsensical response
akira3weet@reddit (OP)
Nah I’m talking about the Cuda vram overhead. It’s well known.
GhostRunner01@reddit
For what it's worth, I have a single 4070super and am getting about 30-40t/s with 128k context all with cuda. I fit what I can in vram and load everything else into my system's DDR4.
I think vulkan is getting you a big performance hit.
Borkato@reddit
Source?
jwpbe@reddit
is it a 'well known' thing unique to cuda or is it just what an llm told you
initalSlide@reddit
I was wondering the same… is there a specific reason for Vulkan? Because I see none
wektor420@reddit
Llama.cpp defaults to vulkan on windows - maybe that is the reason?
pulse77@reddit
Llama.cpp does not "default" to Vulkan od Windows: on the llama.cpp release download page (https://github.com/ggml-org/llama.cpp/releases) you select either Vulkan od CUDA - there is no "default" here...
wektor420@reddit
I installed it through winget, I do not remember choosing during that install process
But yeah you can download directly from github sure
alex20_202020@reddit
Vulkan works with nouveau driver on Linux. In my experience slow but I hope will be faster with new Ubuntu etc.
ChocoPichu@reddit
I also use vulkan even though I have the nvidia card. Idk, the vulkan was really easy to set up, while cuda gave me a lot of trouble setting it up the last time i tried it. Is there a huge difference between cuda and vulkan?
jwpbe@reddit
yeah, an extremely wide gulf in performance because you're not using operations literally designed for the hardware in your card
they pre compile cuda for llama-server, it couldn't be any easier
OttoRenner@reddit
It is so, so funny XD
Just two days ago I had to argue with Gemini that YES, I WANT to try and put my old 4GB GPU in the setup with a 3070 and a 3090, because we are in some kind of post-apocalyptic Fallout situation where scarcity AND demand are increasing and that Frankenstein Builds with old/unconventional cards and setups will be seen all over the place pretty soon... and here comes your post lol
The models will get better and fit on smaller cards, until even 4GB and lower cards will be valuable.
Those cards are especially useful for repetitive stuff that would otherwise take up valuable space on the big cards, or to offload your main agent and keep it somewhat active for small tasks during heavy computation...lots of options for integration.
Savantskie1@reddit
Most frontier models are heavily influenced to stop normies from trying to run local models in the hopes you’ll just drop it and stay working with them. For the longest time I had to keep arguing with ChatGPT that MI50’s were still viable. And it kept trying to steer me towards $1000 cards. I decided to ignore it and bought them anyway because yeah rocm ignored them, because they pulled support, but Vulkan doesn’t care about the generation so long as it can be forced to do compute
OttoRenner@reddit
While I can't rule that out...
The data the models were trained on are full of years and years of people talking about hardware through the lens of gaming. There it absolutely wouldn't make sense to use old hardware. And, at least when it comes to Gemini, I can say that it has no idea how bad the situation is with the gpu prices. I had to screenshot some actual offers online and Gemini was baffled about how much people are paying (and forgot about it 10 min later).
So...it could be BIG AI trying to fool us (absolutely likely!) Or... "normal" trainings bias combined with not up to date world knowledge. Or both...or none lol
Savantskie1@reddit
Oh it’s definitely a combination of both. But even newer models still try to steer “normies” away from self hosting. So I’m still convinced that they’re just trying to keep users on their own models
Mysterious_Role_8852@reddit
I have a 3090 Ti and a 2070. The 2070 is quite bottlenecking the 2070. With Qwen 3.6 27b Q6 Quant I get around 30t/s when loading only on the 3090Ti ( with around 25k Context) and only 20t/s when splitting on both GPUs (82/18 split, 130k context). Also the prompt processing is far slower. But it's definitely much better than offloading to CPU.
Strange_Test7665@reddit
I run a 5090/5060 machine. Same Qwen 27b q6 quant. When I need extra context or extra space for other processes on 5090 I use both cards. I restrict to only making the 5090 available when I need speed. Dual card is about 40% slower tps than single I find
sleepy_roger@reddit
It's slower because the 5060 is slower 2x5090s for example would show a speed up with tensor parallel
rainbyte@reddit
If it has enough PCIe bandwidth, otherwise is better to use pipeline parallelism
Glittering-Call8746@reddit
How much more context u getting with extra 5060?
mac1e2@reddit
Qwen3.6-35B-A3B on GTX 1650 4GB / 62GB RAM: constrained systems still matter
A constrained-hardware result from April 27, 2026.
Machine
- GTX 1650 4GB
- 62GB RAM
- i7-7700
- llama.cpp
- single-slot only
Live profile
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
- --cpu-moe
- -c 65536
- -ctk q8_0 -ctv q8_0
- --mlock
- --cache-ram 32768
- --cache-reuse 256
- --reasoning-budget 256
- --parallel 1
- -ngl 99
- -fa on
- --threads 4
Measured cold on-box retrieval probes, all correct:
- 9289 prompt tokens -> 117.48s
- 11826 prompt tokens -> 152.48s
- 14449 prompt tokens -> 158.67s
Also verified:
- tool calling works
- strict JSON works with:
{"chat_template_kwargs":{"enable_thinking":false}}
- decode is around 20-21 tok/s on this host
- idle llama-server RSS is \~22GB by design
Exact llama.cpp command line:
/home/jvm/src/llama.cpp/build-f65bc34/bin/llama-server \
-m /home/jvm/models/llama.cpp/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
--host 0.0.0.0 --port 8080 --jinja \
--parallel 1 -ngl 99 --cpu-moe --threads 4 \
--cache-reuse 256 --cache-ram 32768 --reasoning-budget 256 \
--mlock -fa on -ctk q8_0 -ctv q8_0 -c 65536
Systemd unit layout:
Base unit:
```ini
# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=jvm
Group=jvm
WorkingDirectory=/home/jvm/src/llama.cpp
ExecStart=/home/jvm/src/llama.cpp/build-f65bc34/bin/llama-server -m /home/jvm/models/llama.cpp/Qwen3-4B-GGUF/Qwen3-4B-Q4_K_M.gguf --host 0.0.0.0
--port 8080 --jinja -c 4096 --parallel 1 -ngl 99
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Drop-in override:
# /etc/systemd/system/llama-server.service.d/41-tuning.conf
[Service]
LimitMEMLOCK=infinity
ExecStart=
ExecStart=/home/jvm/src/llama.cpp/build-f65bc34/bin/llama-server -m /home/jvm/models/llama.cpp/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
--host 0.0.0.0 --port 8080 --jinja -c 65536 --parallel 1 -ngl 99 --cpu-moe --threads 4 --cache-reuse 256 --cache-ram 32768 --reasoning-budget 256
--mlock -fa on -ctk q8_0 -ctv q8_0
So no, this does not beat a 3090.
That is not the point.
The point is that this is a real secondary node on a 4GB GTX 1650 class box, not just a screenshot of something barely loading.
A lot of local-LLM discussion now is a strange mix of vibecoding and brute force:
- add more VRAM
- keep larger GPUs hot
- accept defaults
- confuse “it fits” with “it works”
- confuse “token/sec” with understanding the memory path
- never ask what must remain resident, what can spill, what latency trade is being made, or what correctness costs are hidden by the benchmark
That is not an indictment of new people. It is just a description of a culture that has grown used to hardware forgiving bad habits.
Some of us learned on machines that did not forgive anything.
If you spent time young with a Commodore 64 or a Sinclair ZX, you learned the right lessons very early:
- memory is not abstract
- dataflow is not abstract
- state is not abstract
- every layer has a cost
- every convenience has a price
- if you waste resources, the machine tells you immediately
- if something works, it is usually because you understood the machine rather than because the machine rescued you
That training stays with you.
So the claim here is not “old hardware beats new hardware.”
Obviously it does not.
The claim is narrower and, I think, more interesting:
constrained-systems discipline still goes further than a lot of modern GPU-rich local-LLM practice would suggest.
If people want, I can also post the reasoning around why these flags were chosen and which alternatives were worse on this exact box.
themule71@reddit
Very interesting of course. But you can't just drop "ZX Spectrum or Commodore 64"... It like saying "I used vi or emacs". You have to state on which side you fought the War. :)
Clank75@reddit
C64, not Speccy. Amiga, not Atari. Emacs, not vi. BSD, not SVR4. Init, not Systemd. Big-endian, not little-endian.
I may have lost most of the wars, but goddamn I was right about all of them.
;-)
themule71@reddit
Lol. ZX over C64. Vi over emacs. Linux over 386BSD.
Call it "network byte order" and you can say big endian won. Anyway used the 68000, PowerPC and Sparc64 before finally going AMD64.
I resisted systemd but kinda moved to its side when it was about detractors.
seanthenry@reddit
If you did not know MXlinux has the distro set so you can pick Init or systemd at boot.
mac1e2@reddit
i actually started programming on those 2 competing platforms at same time (Zx and C64). i would say c64 was easier, but how i loved my zx. =) Loved how 2 old timers instantly got important part on that pasted mess... maybe i should come back doing some fun stuff like above. Most reasoning on that came from the hardware constraints of the machine. got a lot of notes.. but most came from reading docs, not posts. Seems there is no magic solution... and i really found posters original challenge interesting, need to dive on that.
farkinga@reddit
I want some pins for my formal attire, Gaddafi style, to show the world where I was, when. Turns out you and i were in many of the same battles!
mac1e2@reddit
i actually made part of both sides.. my neighbour had a commodore..and i was 6. =)
Sooperooser@reddit
I run Qwen 3.6-35b on a gt1030 with 2gb vram and a r7 3700x and 64gb dual channel ddr4 ram. I don't care what the haters say.
PropertyRapper@reddit
I would love to hear your logic on why you chose which flags, I’m trying to learn my hardware better so any benchmark-driven tuning would also be helpful.
Basically did you just know that because X you must Y, did you change Z because Q on benchmark, or combo?
Any tooling you use to watch resources? I just use btop as most advanced dashboard, and only recently. I’m learning :)
mac1e2@reddit
i was very constrained at that gpu. VERY! i just wanted to prove it was possible, ill leave that node running over the week. and report. Also i squeezed everything i could, without risk of having house burn down...so things gpu and cpu only happen while querying that model. My reasoning was never to use it over context of that size, degradation always occurs. Mem lock was a logical choice because of latency and i had lots to spare on that specific machine, and keeping things on ram is cheap, doesnt tear down disk. Keep in mind it should not be an agentic node full 24h, that will wear hardware down, and i dont believe its a good policy... it was just a fun challenge, to show what is really possible. so for now its just a queriable stable node with a llm model in my lan. , i also didnt do bios and other tunning i could.. just the RAM hack .. But gonna do some rtx tweaking before jumping to other machines..
interAathma@reddit
Hey, another GPU poor here. I'm planning to upgrade my potato laptop Ram from 8gbx2 into 8+32 to make it a 40gb system. I got a gtx 1650 4gb with it. Is it worth the upgrade? It's just for hobby, I want to try out larger LLM models for my use case. Currently I can run 4b quantized models on full gpu on my 1650 with around 10k context length, and it's enough for my use case. And 8b models with GPU offload or cpu only. My confusion is whether is it worthy to upgrade just RAM to try out new models.
Ardalok@reddit
Are you sure about 8+32? My thoughts: 16+16 is faster due to dual-channel, but 8+32 lets you fit a better quant of the 35b-a3b model. You could use cpu-moe flag in llama.cpp to load only active params and context to VRAM, but a 4GB card can't hold both a large quant and a decent context size. I’m not sure the trade-off is worth it, might be better to buy 16+16, but it depends on your use case. 8+32 is still faster than single channel because of flex mode. With dual channel you would probably get 10-20 t/s and less with flex mode.
I personally get about 25-30 t/s on 12400f with 32 GB DDR5 and RTX 4060 with UD-q4_k_xl and 90% of RAM used by this + web browser and Windows 10.
So if you don't care about speed and just want to use the best local model with readable speed, 8+32 is probably better. But if you want speed, then 16+16 is the way too.
akira3weet@reddit (OP)
Only if you can get that ram for dirt cheap. Remember that the larger the model the slower it runs. See if you can get a better value from a used low/mid end gpu.
interAathma@reddit
Yeah, I'm looking for used RAMs, even those prices are inflated in this current market. Since its a laptop, only thing I can upgrade is ram, so I'm trying to squeeze out every bit of performance it can.
mac1e2@reddit
well i edited v-lock so i could have some usage of the extra ram i had on the box. no need to do that. if you run any linux, most steps are there. slap something like cachy and play around. keep in mind most problem on llms are still long context. i was lazy and just spammed posted.. somethings there.
Oshden@reddit
Yes please! This is one of the most helpful posts I’ve seen so far on constrained use of local LLMs
morrischu1986@reddit
Yes could you explain the why those flags were chosen please? Thanks in advance!
redditpad@reddit
Nice I was trying a similar approach using that exact same model family
Far-Awareness8746@reddit
Could i run a 3090 24gb and a 4060ti 16gb together? Would i need one of those bridges like the old days?
kil341@reddit
You mean the old gfx card that blocks the fans on the new one if I put it in the other slot on my motherboard?
Pwc9Z@reddit
"To the homeless users, just live in your old house"
nitestryker@reddit
Out of curiosity , could I do this with an egpu ?
Savantskie1@reddit
Possibly, but don’t quote me on that.
zulutune@reddit
It looks like 2x4060ti is the sweet spot. I can get two of them under 1000 euros. Spend a 1000 more for the other parts and you have a 32GB for less then 2k euros
akira3weet@reddit (OP)
On paper dual 4060ti, dual 5060ti, dual 5070ti seem to offer similar value per dollar, some might be slightly higher than other. Also depends on which fit the bill.
SaltAddictedMan@reddit
I have a 1080ti and 3 1080s, and i got them all working. But I never thought to plug the monitor into the cpu lol
Ephemere@reddit
In general, I absolutely agree with the advise above.
I did however try to push this a bit further and tried to make use of the remaining pcie x1 slot to add a 12GB 3060 to my dual 3090s. So to do that I had to get a second power supply and a power supply synchronization card, plus the cable extender because of course a 3060 isn't fitting into a pcie x1 slot. It worked! The card showed up to the system, an additional 12GB of ram! And.... it absolutely cratered performance. Was way slower than just leaving the remaining layers that couldn't fit in the 3090s on the CPU.
It was a fun project, though.
Imaginary_Belt4976@reddit
Cool, didnt know llama could do this. dont suppose theres a way to easily bridge across two machines?
Ell2509@reddit
I use layer split to share larger models over a 9700ai and a w6800. 2 gens apart in architecture terms. Works though. That said, it was tricky.
If your GPUs are different BRAND, then the best thing to do is have the OS/any programs use the weakest gpu, and load model weights/kv cache onto the strongest card only. That way, your beat card is 100% available for inference.
Glittering-Call8746@reddit
Care to share ur os setup and llama.cpp command line ?
Ell2509@reddit
I like this for models that fit on a single 32gb gpu:
~/llama.cpp/build/bin/llama-server \ -m \
--alias \
--main-gpu 1 \
-ngl 999 \
--parallel 1 \
--no-warmup \
--ctx-size 32768 \
--cache-ram 0 \
-fa auto \
--host 0.0.0.0 --port 3000
arjuna66671@reddit
I got a 5060 16gb TI and an old 3060 12gb rotting in its box... But I need a riser cable to put it in. I'll try it xD.
general_sirhc@reddit
3060
Rotting in it's box
Some of us just upgraded to the 3060 😱
_-_David@reddit
I had a 5060ti 16gb rotting in its box that I just added.. 😬
general_sirhc@reddit
That's my current dream card.
But I realistically want it so the AI stuff I fiddle with fits in memory and it's just not worth it because I rarely game and I only have a 1080p monitor for the games I do play.
_-_David@reddit
I feel that. I went from 1080 to 5060, then it was too new and none of the drivers were stable. I threw it back in the box. But I had the money for a 5090 splurge after an investment paid off unexpectedly well. At the end of the day I am also just fiddling around with AI, don't game, and constantly think about what I could buy instead if I sold them both lol
arjuna66671@reddit
We were lucky not only buying our new PC's when the hardware prices were lowest but also the 5060 upgrade was cheap xD. Two month later and it wouldn't have happened 😅
falcongsr@reddit
I got a 5070Ti and I haven't sold my 3060 12GB yet. Hmmmm
initalSlide@reddit
3060 should still give quite good performances!
redpandafire@reddit
I want to do this but the thing stopping me is power and space. I run a 16gb 4080S, and have a 12GB 3080ti, but I don’t see a slot where the 3080 can fit without grinding the 4080. Plus what power even runs these two cards lol I’m gonna have a stove top.
miltonthecat@reddit
For what it’s worth, I put my 4070 back into circulation with my 5070ti as an eGPU over Oculink (PCIE 4.0 x4) for a total of 28 GB VRAM. Still seeing 25% performance gains having all layers of Qwen split across the two GPUs vs. most layers on the 5070ti and some on the CPU.
Cable management has improved a bit since this pic.
markovianmind@reddit
do u have a separate power supply for that? also besides power what cable u need
Yanix88@reddit
There are also egpu's with integrated power supply's like aoostar ag02. The egpu is usually connected to your PC using either Oculink (some miniPC's has Oculink port directly available, if not - it can be added to any PC using pci-e or m2 slot) which is the fastest possible connection for egpu, or using usb4/thunderbolt which is still fast, but a little slower than Oculink.
miltonthecat@reddit
I should have bought that one. Looks much cleaner than mine.
miltonthecat@reddit
Yes, I grabbed a cheap Rosewill PSU, an eGPU dock and a PCIEx16 to Oculink breakout card.
markovianmind@reddit
thanks. I have a mini itx mobo (msi edge 650i) in a sff case and I want to use the second nvme slot for nvme to PCIe cable to attach a second external gpu when required , is that doable as well? like I am thinking buying a cheap psu like yourself. I have the 9070xt in the case. I can buy 9060xt 16gb for the second external gpu. would the llm token inference speed be bottleneck by 9060xt in this case especially due to x4 slot?
miltonthecat@reddit
It’s hard for me to say. I’ve been reading this subreddit for maybe two years now and still feel like there's so much I don't understand. My guess would be that, yes, the X4 connection will bottleneck inference, but that bottleneck will be much smaller than having to offload some model layers to the CPU.
rgldx@reddit
I wish I had that new card, since I'm already using your 'old' one as my primary card :\^)
cbterry@reddit
I tried this on an HP i9 and the machine wouldn't even POST. A 5060 in the first slot and a 3060 or 3090 in the 2nd slot. 3090+3060 worked, 3090 in the 2nd slot due to size. I have another machine now so it's NBD but wonder if anyone knows why it wouldn't boot?
saxtondk@reddit
What is the power supply in the HP rated at?
cbterry@reddit
It's a 1000w PSU. I have not been able to find anything about what could cause this lol
Repulsive_Coffee_675@reddit
AMD users dont have to worry about that. Modern laptops also come with 32gb ram/vram (shared). Old gpus with 16gb are also predominantly AMD
Local_Phenomenon@reddit
On a Monday!
alex20_202020@reddit
So what? Might be even faster on one card, you have not provided a comparison!
akira3weet@reddit (OP)
Single digit on single card at the same context length.
Vapourium@reddit
You should switch over to CUDA and then run a couple of llama-bench runs at different context depths
initalSlide@reddit
+1 this is the way
misanthrophiccunt@reddit
This is awesome, thank you, especially the commands you described at the bottom.
mac1e2@reddit
hope it helps.. from an old engineer. =)
akira3weet@reddit (OP)
It's a great read. But dude the format is so bad I had to ask an agent to reformat it for me :)
Didn't know the --cpu-moe flag before. Learned something new. It's the most important thing enabled the efficiency I suppose? The other tunings are excellent too.
Funny thing is if you are like me with 32GB and running Windows, there is barely any room for LLM to run on RAM, maybe 12GB or so. And RAM today is so much worse value than GPU...
mac1e2@reddit
going to next machine.. just posted enough to be able to be reproduced. Got challenged by a friend in making that work, told me it was impossible. Lol never finetuned or touched lamma or models before this weekend. Jumping to a another old machine with rtx4060 now.. eventually ill reach my newer computers and mac . =). just posted a working model on something that most of this reddit would say was impossible. and its a node on my local lan. ohh, i won the bet, btw =) 4gb v-ram.. lol
taking_bullet@reddit
I'm about to do similar thing - adding Radeon RX 9070 to GeForce RTX 5070 TI. Just waiting for package with a riser.
Damogran6@reddit
I did the same. Upgraded the power supply, though.
Finanzamt_kommt@reddit
You could just powerlimit your second gpu it won't make a real difference for llms in llma.cpp here and you don't have to worry that it takes like 300w
Adventurous-Gold6413@reddit
I only got a laptop :(
mac1e2@reddit
Qwen3.6-35B-A3B on GTX 1650 4GB / 62GB RAM: tuned for the ugly path, not the easy one
A constrained-hardware result from April 27, 2026.
Machine
- GTX 1650 4GB
- 62GB RAM
- i7-7700
- llama.cpp
- single-slot only
Live profile
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
- --cpu-moe
- -c 65536
- -ctk q8_0 -ctv q8_0
- --mlock
- --cache-ram 32768
- --cache-reuse 256
- --reasoning-budget 256
- --parallel 1
- -ngl 99
- -fa on
- --threads 4
What that means in plain terms:
- keep the dense/shared path and KV inside a 4GB VRAM budget
- let MoE spill where it has to
- spend RAM deliberately instead of pretending RAM is sacred
- optimize for one useful request at a time, not fake concurrency
Idle shape:
- \~22GB RSS on llama-server
- low CPU/GPU when idle
- GPU spikes hard during prefill / active compute
Measured cold on-box retrieval probes, all correct:
- 9289 prompt tokens -> 117.48s
- 11826 prompt tokens -> 152.48s
- 14449 prompt tokens -> 158.67s
Also verified:
- tool calling works
- strict JSON works with:
{"chat_template_kwargs":{"enable_thinking":false}}
- decode is around 20-21 tok/s on this host
So this is not “fast” in 3090/4090 terms.
That is not the point.
The point is that this is a real secondary node on a 4GB GTX 1650 class box, not just a screenshot of something barely loading.
If you learned early on that:
- memory is finite
- state matters
- dataflow matters
- every abstraction has a cost
- and hardware limits are design inputs, not excuses
then this kind of tuning makes sense.
If you learned on machines where waste was invisible, it probably doesn’t.
I’m not claiming this beats newer hardware.
I’m claiming “24GB or don’t bother” is often less a law of nature than a statement about how little tuning most people are willing to do.
If useful, I can post the exact command line / systemd profile.
If you want, I can make this slightly sharper without turning it into cringe.
› be explicit on problems with vibecoding and people brute forcing these days and do commodore zx experience. =)
• Use this version.
Title
Qwen3.6-35B-A3B on GTX 1650 4GB / 62GB RAM: what vibecoding forgot about constrained systems
A constrained-hardware result from April 27, 2026.
Machine
- GTX 1650 4GB
- 62GB RAM
- i7-7700
- llama.cpp
- single-slot only
Live profile
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
- --cpu-moe
- -c 65536
- -ctk q8_0 -ctv q8_0
- --mlock
- --cache-ram 32768
- --cache-reuse 256
- --reasoning-budget 256
- --parallel 1
- -ngl 99
- -fa on
- --threads 4
Measured cold on-box retrieval probes, all correct:
- 9289 prompt tokens -> 117.48s
- 11826 prompt tokens -> 152.48s
- 14449 prompt tokens -> 158.67s
Also verified:
- tool calling works
- strict JSON works with:
{"chat_template_kwargs":{"enable_thinking":false}}
- decode is around 20-21 tok/s on this host
- idle llama-server RSS is \~22GB by design
So no, this does not beat a 3090.
That is not the point.
The point is that this is a real secondary node on a 4GB GTX 1650 class box, not just a “look it loaded” screenshot.
The thing I keep seeing now is vibecoding plus brute force:
- throw more VRAM at it
- keep bigger GPUs hot all day
- accept defaults
- confuse “fits” with “works”
- confuse “benchmarks” with “operationally useful”
- never learn the actual shape of the memory path
- never ask what is dense, what is KV, what can spill, what must stay resident, what kills latency, what kills correctness
That is not engineering. That is renting headroom and calling it insight.
This result came from tuning the ugly path:
- keep dense/shared path and KV inside a 4GB VRAM budget
- let MoE spill where it has to
- spend RAM deliberately instead of worshipping free RAM
- optimize for one useful request at a time instead of fake concurrency
- accept that latency is the first hard limit long before correctness collapses
That mindset did not come from “AI engineer” culture.
It came from growing up on machines like the Commodore 64 and Sinclair ZX, where you learned early:
- memory is finite
- state matters
- dataflow matters
- every abstraction costs something
- there is no invisible infinite cushion behind your mistakes
- if the system works, it is because you understood the constraints, not because the hardware forgave you
That kind of background trains you to see resource shape, not just headline specs.
Again, I am not claiming this beats newer hardware on throughput.
I am claiming that a lot of modern local-LLM discourse is shallow because people brute-force around problems they do not actually understand.
If useful, I can post the exact llama.cpp command line and systemd profile.
If you want, I can make it:
- slightly more aggressive
- more elegant
- or more “old engineer, dry contempt” and less direct.
› give lamma.cpp command line and system.d profile , and old engineer, with smarts
• Title
Qwen3.6-35B-A3B on GTX 1650 4GB / 62GB RAM: constrained systems still matter
A constrained-hardware result from April 27, 2026.
Machine
- GTX 1650 4GB
- 62GB RAM
- i7-7700
- llama.cpp
- single-slot only
Live profile
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
- --cpu-moe
- -c 65536
- -ctk q8_0 -ctv q8_0
- --mlock
- --cache-ram 32768
- --cache-reuse 256
- --reasoning-budget 256
- --parallel 1
- -ngl 99
- -fa on
- --threads 4
Measured cold on-box retrieval probes, all correct:
- 9289 prompt tokens -> 117.48s
- 11826 prompt tokens -> 152.48s
- 14449 prompt tokens -> 158.67s
Also verified:
- tool calling works
- strict JSON works with:
{"chat_template_kwargs":{"enable_thinking":false}}
- decode is around 20-21 tok/s on this host
- idle llama-server RSS is \~22GB by design
Exact llama.cpp command line:
/home/jvm/src/llama.cpp/build-f65bc34/bin/llama-server \
-m /home/jvm/models/llama.cpp/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
--host 0.0.0.0 --port 8080 --jinja \
--parallel 1 -ngl 99 --cpu-moe --threads 4 \
--cache-reuse 256 --cache-ram 32768 --reasoning-budget 256 \
--mlock -fa on -ctk q8_0 -ctv q8_0 -c 65536
Systemd unit layout:
Base unit:
```ini
# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=jvm
Group=jvm
WorkingDirectory=/home/jvm/src/llama.cpp
ExecStart=/home/jvm/src/llama.cpp/build-f65bc34/bin/llama-server -m /home/jvm/models/llama.cpp/Qwen3-4B-GGUF/Qwen3-4B-Q4_K_M.gguf --host 0.0.0.0
--port 8080 --jinja -c 4096 --parallel 1 -ngl 99
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Drop-in override:
# /etc/systemd/system/llama-server.service.d/41-tuning.conf
[Service]
LimitMEMLOCK=infinity
ExecStart=
ExecStart=/home/jvm/src/llama.cpp/build-f65bc34/bin/llama-server -m /home/jvm/models/llama.cpp/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
--host 0.0.0.0 --port 8080 --jinja -c 65536 --parallel 1 -ngl 99 --cpu-moe --threads 4 --cache-reuse 256 --cache-ram 32768 --reasoning-budget 256
--mlock -fa on -ctk q8_0 -ctv q8_0
So no, this does not beat a 3090.
That is not the point.
The point is that this is a real secondary node on a 4GB GTX 1650 class box, not just a screenshot of something barely loading.
A lot of local-LLM discussion now is a strange mix of vibecoding and brute force:
- add more VRAM
- keep larger GPUs hot
- accept defaults
- confuse “it fits” with “it works”
- confuse “token/sec” with understanding the memory path
- never ask what must remain resident, what can spill, what latency trade is being made, or what correctness costs are hidden by the benchmark
That is not an indictment of new people. It is just a description of a culture that has grown used to hardware forgiving bad habits.
Some of us learned on machines that did not forgive anything.
If you spent time young with a Commodore 64 or a Sinclair ZX, you learned the right lessons very early:
- memory is not abstract
- dataflow is not abstract
- state is not abstract
- every layer has a cost
- every convenience has a price
- if you waste resources, the machine tells you immediately
- if something works, it is usually because you understood the machine rather than because the machine rescued you
That training stays with you.
So the claim here is not “old hardware beats new hardware.”
Obviously it does not.
The claim is narrower and, I think, more interesting:
constrained-systems discipline still goes further than a lot of modern GPU-rich local-LLM practice would suggest.
If people want, I can also post the reasoning around why these flags were chosen and which alternatives were worse on this exact box.