Converted my unused laptop into a family server for gpt-oss 20B
Posted by Vaddieg@reddit | LocalLLaMA | View on Reddit | 94 comments
I spent few hours on setting everything up and asked my wife (frequent chatGPT user) to help with testing. We're very satisfied so far.
Keys specs:
Generation: 46-40 t/s
Context: 20K
Idle power: 2W (around 5 EUR annually)
Generation power: 38W
Hardware:
2021 m1 pro macbook pro 16GB
45W GaN charger
Power meter
Challenges faced:
Extremely tight model+context fit into 16GB RAM
Avoiding laptop battery degradation in 24/7 plugged mode
Preventing sleep and autoupdates
Accessing the service from everywhere
Tools used:
Battery Toolkit
llama.cpp server
DynDNS
Terminal+SSH (logging into GUI isn't an option due to RAM shortage)
Thoughts on gpt-oss:
Very fast and laconic thinking, good instruction following, precise answers in most cases. But sometimes it spits out very strange factual errors never seen even in old 8B models, it might be a sign of intentional weights corruption or "fine-tuning" of their commercial o3 with some garbage data
Havoc_Rider@reddit
What is frontend are you using to access the functionalty of the Model?
Also, Tailscale funnel can be used to access the service over public internet
Vaddieg@reddit (OP)
open weight LLMs contain public domain knowledge, there's nothing to secure or encrypt
Extra-Virus9958@reddit
Dont speak if you dont known
Vaddieg@reddit (OP)
I speak because I know)
I won't be buying a SSL certificate for a simple home server, neither add a user DB or access management
Extra-Virus9958@reddit
Tailscale is free, Cloudflare Tunnel is free, etc.
Vaddieg@reddit (OP)
overhead, i don't turn on VPN on my phone just to access wikipedia or chatgpt
Extra-Virus9958@reddit
This isn’t about privacy, it’s about not directly exposing your network ports which creates massive security vulnerabilities.
A tier provider like Tailscale or Cloudflare exposes your service through a tunnel, making direct intrusion impossible.
An intrusion could mean:
It’s trivial today to scan entire IP ranges looking for exposed LLM providers. If tomorrow you add tool support or MCP (Model Context Protocol), you’re giving direct access to your system or enabling attack pivoting.
llama.cpp is NOT a hardened web server designed for public exposure. There are countless attack vectors:
It’s your life and your security, but I genuinely don’t understand why you’d refuse FREE secure solutions like Cloudflare Tunnel that even handle authentication via Cloudflare Access.
You’re basically running a development tool as a public-facing service. That’s like using notepad.exe as a production web server. The fact that you haven’t been compromised YET doesn’t mean you’re secure - it just means you haven’t been targeted yet.
Vaddieg@reddit (OP)
Looks like AI generated VPN promo. I know how networks work, what can be accessed from outside and what can't. NO THANK YOU
Secure_Reflection409@reddit
The rest of your home network? Your iphones, your voice enabled tv, Alexa, etc.
At least put the llm on it's own vlan.
Vaddieg@reddit (OP)
the est of my local network isn't exposed in any way. You will need RCE exploit for llama + LPE exploit for macOS to reach my network
Havoc_Rider@reddit
Tailscale is a service and funnel is a their sub service, which can help you access your local running model from anywhere(public Internet). No need to of a SSH cert or local database. Again, I don't know what frontend you are using, if you access the local model from a web browser over your local network using device-ip:port then you can turn on the funnel from Tailscale on that, you will be given a web address, which can be accessed from any device on Internet.
mobileJay77@reddit
Keep us posted on your use cases!
Vaddieg@reddit (OP)
they are very typical. But instead of chatgpt.com I type my-sweet-home.dyn-dns.net in browser (address is fake). There is no request limit and context is almost the same as with gpt+ subscription
Beestinge@reddit
what do you or your family typically ask it? i found it very bad
Vaddieg@reddit (OP)
My son uses it more like wikipedia to learn more about the world, my wife plans nutrition and traveling. I summarize texts that contain sensitive/private data, ask practical questions in programming area.
But gpt-oss is surely weaker than commercial ChatGPT
Beneficial_Tap_6359@reddit
Please do not let your kid trust this as a source of knowledge. Small models especially are absolutely terrible and will hallucinate nonstop without being obvious to a kid at all!
giantsparklerobot@reddit
Actual Wikipedia is right there! Don't teach your son to trust a sycophantic bullshit token generator. That's a painfully bad idea.
Vaddieg@reddit (OP)
he's capable of using both, lol. What I find to be the absolute evil for children is y-tube because of its irrelevant, stupid but eye-catching autoplay recommendations
epyctime@reddit
I hope you've got auth in front of this?
Vaddieg@reddit (OP)
Sure not. My LLM server has as much private data to steal as wikipedia
epyctime@reddit
It's more about the resource consumption 😂 but you do you king
Vaddieg@reddit (OP)
authentication over HTTP is useless anyway, SSL is overkill. llama.cpp server supports API key but I don't bother setting it up
epyctime@reddit
These are all silly opinions and you should ask your LLM about it
Vaddieg@reddit (OP)
No I shouldn't. Those shitty LLMs got trained on data that includes my upvoted StackOverflow replies and github code snippets too.
epyctime@reddit
Even a shitty LLM will regurgitate the opposite of your points. Not even having an API key is legit crazy. I'm port scanning for u right now (jk)
Vaddieg@reddit (OP)
i wish you great success in your efforts)
epyctime@reddit
Perhaps your false confidence is because you think nobody can find your ip/port? https://www.shodan.io/search?query=llama.cpp check this out
Vaddieg@reddit (OP)
nice try buddy
epyctime@reddit
Nice try what brother I'm legit looking out for you 😂😂😂 this is for ollama but it's the exact same thing https://blogs.cisco.com/security/detecting-exposed-llm-servers-shodan-case-study-on-ollama
Vaddieg@reddit (OP)
ok, I will setup an API key and request logging if I find out that someone is using my unique and super-fast 38W server with an oss model for free. 🤣
epyctime@reddit
I just don't get why you don't put an api key in 🤣 also, why are you using llama.cpp server over llama-cli if you're only using SSH? this would avoid the entire problem lol
Vaddieg@reddit (OP)
I converted a useless piece of hardware into something my family uses daily and it costs nearly nothing to run. Mac mini would have been a much better choice for a headless home server but I don't have one and not planning buying it.
Vaddieg@reddit (OP)
it's not. ollama is a useless wrapper. Looks like your port scanner is broken though
epyctime@reddit
I know dude, but open ports are open ports, did you miss my entire point?
what does this even mean brother.. are you confused or something? shodan scans the internet constantly, the chance of your endpoint being detected is 100%. not 99% - 100%. It will be discovered.
Vaddieg@reddit (OP)
an open port is just an open port. It gives you 0 information about the target system or service behind especially with a non-standard port number.
Ok, my service will be discovered. What next?
For a random pundit over the internet there are no obvious benefits over chatgpt.com or deepseek.com. DDoS for fun? Sophisticated targeted attacks against an unknown host?
cms2307@reddit
How do you feel about oss-20b for replacing ChatGPT plus? Personally I use ChatGPT in a similar way to Google so I haven’t tried to make the jump to local only yet.
Vaddieg@reddit (OP)
local LLMs have already replaced google translate for me. And roles/purposes are trivial to customize via system prompt
RRO-19@reddit
This is cool. What's the performance like compared to cloud APIs? Curious about the practical tradeoffs for family use - speed vs privacy vs cost.
Vaddieg@reddit (OP)
it's considerably slower but much more useful than free chatgpt. It's quite private unless I'm accessing my server from public wi-fi networks and cost you nothing if you already have a capable hardware.
PeanutButterApricotS@reddit
I setup lama.cpp but swapped to LM Studio for MLX because every other option didn’t work.
I have a Studio M1 Max so it goes to open web ui running in docker. I prefer it as it has login security and web search. Why don’t you use it? Is it because it won’t run with the laptops limited resources?
Vaddieg@reddit (OP)
it's a minimalistic home setup that consumes 2W at the wall and runs on entry level m1 pro with 16GB. I don't consider open web ui because it so bloated that it will need a dedicated server to run
PeanutButterApricotS@reddit
Gotcha I figured just was curious. It is a resource hog, but it does have its benefits. Glad you found something that worked for you.
Sol_Ido@reddit
Unused hardware: 2021 m1 pro macbook pro 16GB
rjames24000@reddit
sick, do you have an tutorial or guide for how you set it up, does it all just run bare metal?
after experimenting would you change anything to improve responses?
Vaddieg@reddit (OP)
I had a goal and tried solving problems as they emerged. No tutorials or plan.
There are some random picks:
all quants of gpt-oss are nearly the same file size, so I picked one marked "recommended" from bartowski, I plan to download something of similar size from unsloth and compare
I've stolen 14 of 16GB from the system to GPU, probably macOS can survive even on 1 GB without GUI session but I haven't tried
"nohup" command to keep llama-server running after disconnecting SSH session
BatteryToolkit holds the battery charge and keeps mac awake with lid closed, but it also made my 0.5W idle power goal unreachable because I can't use sleep + wake on wi-fi
jesus359_@reddit
Can I suggest Tailscale to air-tighten that server? Look up recent announcement of unsafe llm servers and another story how someone was using a redditors llm for private use for months and they didnt even know.
Vaddieg@reddit (OP)
there's nothing unsafe in hosting public domain data, and I don't care much if someone will be using it. At 38W it's basically free
pwrtoppl@reddit
you can get 100 devices and 3 users on tailscale for free (probably could use a shared login to expand). jesus359_ is right, get tailscale. I publicly exposed a flask port thinking no one would ever care or find it in rural CO, it took only hours before I had unhandled traffic moving around. thankfully I took down my MCP tooling (filesystem and browser with cached creds) or things would have been worse for me.
it takes minutes and can save you such much sanity, and you get remote lan, which is a huge plus for network resource access.
jesus359_@reddit
You know, Im at that point in life where its like, you’re right. You’re safe. Then chuckle when I see a reddit post saying, someone has been in my network for months and I didn’t even know it.
Its 2025. Not early 2000s anymore. Learn about BASIC internet security at least please. Might as well now have a password on your wifi and let us know your credit card numbers too please.
johnnyXcrane@reddit
How big of a context size can you run with an acceptable speed?
Vaddieg@reddit (OP)
it's capped by limited memory. 20k is more than 8 in chatgpt free, but less than 32k in chatgpt+
johnnyXcrane@reddit
Yeah thats enough for that kind of model imo anyway
Chance_Value_Not@reddit
Run Linux on it instead. A shortcut for access anywhere is Tailscale.
Extra-Virus9958@reddit
Why do it? Running Linux on top of that would be a drastic loss of performance
Chance_Value_Not@reddit
I need a citation on that performance claim, but when reading automatic os updates I assumed windows! 🤦♂️ macOS should be fine indeed
Chance_Value_Not@reddit
I.e you’d run asahi Linux directly on bare metal, though, admittedly, I’ve not checked GPU support which is important…
Extra-Virus9958@reddit
Asahi will just degrade performance
Bolt_995@reddit
How did you create a local LLM server on that laptop and how are you accessing it from your phone?
Vaddieg@reddit (OP)
It has a public internet address issued by DynDNS service (I use a free one). My home router does port forwarding to make sure that laptop receives requests to LLM
redule26@reddit
nice thanks for sharing!
rorowhat@reddit
I did the same, but I emoved the battery and doubled the ram to 32gb for a few bucks. Can't do that on Mac tho, unfortunately.
Vaddieg@reddit (OP)
can't get 2W idle on any other laptops
OcelotMadness@reddit
The Snapdragon X elite can do that and LM Studio supports it. Probably not the most cost effective though.
StellanWay@reddit
Snapdragon 8 Elite too, can use the same OpenCL backend.
Final_Wheel_7486@reddit
I would like to introduce you to my shitty 2014 Intel Core i5 7200U laptop with conservative power governor (draws 1.9 W at idle and is shit at everything else)
rorowhat@reddit
I get about 4W but that's nothing. It will cost me $5.6 a year
sexytimeforwife@reddit
It's very cool that you're doing this...I'd love to set up something like what you've done for us...but...is it actually any good??
In my test of gpt-oss in LMStudio...it appears to be completely lobotomized.
I feel like I'm using it wrong. What sorts of things do you guys use it for?
shittyfellow@reddit
I've seen this a bunch, but I actually have the opposite experience. I get good results with the 120b model as long as I don't trigger its super sensitive censorship.
I do get better results with DeepSeek-R1-0528-UD-IQ1_S or GLM-4.5-Air-UD-Q4_K_XL but the 120b gpt-oss has been more than serviceable.
Zealousideal_Nail288@reddit
Pretty sure they are all talking about the smaller 20b model here not the 120b
robberviet@reddit
Show us the parameters please. And for any people posting like this, please, I am very interest in llama.cpp parameters. Setting up infra is easy, easy to google too; but params is rarely shared.
Vaddieg@reddit (OP)
all default except of context size, I'm surprised how well it works. I want to try memlock argument to improve time to first token performance
Handiness7915@reddit
ah.. at first I saw "unused laptop" I suppose it is a very old laptop until i saw M1 pro 16GB, WTF i am still using m1 16GB for my main laptop.
ScienceEconomy2441@reddit
What inference engine did you use to run it? How are you sending requests to it? Are you using a third party tool or directly hitting the v1/chat/completions endpoint?
Vaddieg@reddit (OP)
I use embedded web ui of llama.cpp server. It's not polished but very lightweight and functional
ScienceEconomy2441@reddit
Oh interesting I didn’t know llama.cpp had that.
I have this hunch that gpt oss 20b is a great base model that at the end they threw instruction/tooling capabilities on top.
Trying to build a framework to see if that’s true. Not sure if you have any experience with getting the model to complete statements vs instructions/ tooling capabilities calling.
My thoughts are purely skeptical. Trying to build a framework to find out if I’m right or wrong.
g19fanatic@reddit
Use sigoden/aichat tied with sigoden/llm-functions as a framework. Super easy to get up and running with any model/backend. Even has a front end that is quite serviceable
Only_Comfortable_224@reddit
Did you setup web search somehow? I think the LLM's knowledge is usually not enough.
Equivalent_Cut_5845@reddit
Do you have web search though
HeWhoRoams@reddit
I've got an oldish laptop with a GPU, but not a Mac. I'm curious if there's an image or OS that would be optimal to run to accomplish this? Feels like overkill to install a full OS if I just want the hardware to be doing this. Really cool idea and now I'm inspired to put this old laptop to use.
Consumerbot37427@reddit
Do you have flash attention enabled? If not, there's a good speed boost to be attained.
~~Battery Toolkit is perfect for your use case. Just set it to stay at 50%.~~ I see you already figured that out!
I hope to do something similar. At the moment, I have Home Assistant available from anywhere via Cloudflare's Secure Tunnel. Should be fairly simple to do same in MacOS. This is possible even without port forwarding on router.
cristoper@reddit
Note that recent versions of llama.cpp will try to automatically turn on flash attention:
https://github.com/ggml-org/llama.cpp/commit/e81b8e4b7f5ab870836fad26d154a7507b341b36
Vozer_bros@reddit
yo, that's mean if I have a broken monitor M1 pro + 64GB RAM => turn it into a homelab LLM hosting is a gud idea
MarathonHampster@reddit
Very fast even hosting on that machine? Seems worth it to set up a "family" GPT server and own all your conversations.
beedunc@reddit
Interesting, thanks!
dadgam3r@reddit
Ooooh make sense
dadgam3r@reddit
40t/s on M1 16G? How did you do that? My m1 struggles to generate 8t/s using ollama... Or lm studio. How did you manage that
superbv9@reddit
OP has an M1 Pro
zzrscbi@reddit
Just bought a mac mini m4 16gb for some local usage of 8b and 12b llms. How did you manage to load the whole 20b model with 16gb?
BlueSwordM@reddit
By default, gpt-oss models have been natively quantized down to MXFP4, around 4-bit quantization.
That takes the model loading (without context) RAM footprint from 20GB in 8-bit to around 12GB.
Vaddieg@reddit (OP)
Increased VRAM limit to 14GB:
sudo sysctl iogpu.wired_limit_mb=14336
gave up GUI session and connect over SSH to start llama server:
llama-server -m models/openai_gpt-oss-20b-MXFP4.gguf -c 20480 --host 0.0.0.0
jarec707@reddit
You could also run AnythingLLM to on your client computers to access the server. 3sparkschat will access the server from ios.
Vaddieg@reddit (OP)
i just use browser to access my server's web frontend. most trivial thing
Professional-Bear857@reddit
Did you quant your kv cache to q8 to give you more room for context? Also maybe try updating llama server if your getting strange behaviour.
Vaddieg@reddit (OP)
i run it with default KV sizes for now. I think it would be possible to squeeze 22-24K by playing with iogpu.wired_limit_mb