what’s actually stopping an insider from leaking model weights?
Posted by itsArmanJr@reddit | LocalLLaMA | View on Reddit | 95 comments
this is a dumb question. what are the actual technical barriers stopping an engineer at a place like openai or anthropic from just exporting flagship weights and leaking them? yes NDAs exist, but since llms are more self-contained and portable than traditional enterprise software, to me it seems like exfiltrating them would be relatively easier compared to other closed-source stacks. why hasn't this happened more? (i think the original llama was actually leaked)
rpkarma@reddit
They're pretty big, and big corps track everything you do on your PC. I can't even plug in USB drives or anything into my laptop without IT Security knowing.
arstarsta@reddit
It depends. I worked at big tech and was allowed to install Linux completely unmanaged with root to do AI projects.
Firm-Fix-5946@reddit
but what kind of access did you have to other machines and other data from that instance? could you ssh into some storage system and get copies of weights?
arstarsta@reddit
It wasn't an LLM company. But I got the data needed for training the models. Basically just a service account key and then fetch from big query.
awesomeunboxer@reddit
Well why didn't you leak it then! 😡
Due-Memory-6957@reddit
Don't be mad, maybe he's the guy that leaked Mistral 70b back then
robertpro01@reddit
Also, I don't think they have the closed weights on their work computers, it's not like code were you need access to edit it.
SnackerSnick@reddit
I think this is the answer. I was an engineer at Google, MS, and Amazon and I never heard that plugging a thumb drive into my work computer was a problem (unless it was a thumb drive you picked up in the parking lot...)
balder1993@reddit
I work at a bank and it’s strictly forbidden and disabled somehow. Even copy pasting anything in the browser takes many dozens of seconds to “check for anything funny”.
SidneyFong@reddit
Banks are much more locked down than the typical Big Tech company.
Imagine Microsoft locking down the computers of their Windows developers and told them they can't use their thumb drives due to corporate policy, and don't even think about installing a beta version of Windows (What do you mean you can't do your work without Visual Studio? :P ) I mean, realistically, that's not going to happen.
The typical banker uses computers to move data around in an excel spreadsheet or something.
Tairc@reddit
I work for MSFT, on sensitive projects. I once formatted a company secure USB drive and used it on my corp PC. Less than 30 minutes later I had corpo security on the phone asking why I had used an unauthorized USB device. Apparently there was some sort of low level file/thing done when they commissioned each drive to track them.
Not true for every system, but for sensitive ones? They have ways. Especially MSFT - if you know the domain policy settings well enough, they’re all in there.
Mundane_Ad8936@reddit
"They have ways"
Yes they use the features that have been built into the OS for the past 18 years.. its a check box in Windows Group policy
Tairc@reddit
That’s … literally what I said. I’m not sure what you’re getting at.
Mundane_Ad8936@reddit
This is how we know you've never worked for a big tech firm..
Its virtual machines and containers, sandboxed just like everyone else.
SnackerSnick@reddit
The back end, sure. But most engineers have work laptops and/or desktops, and they have copies of the code for at least the systems they're working on locally, and it's trivial to pull the code for other systems that aren't super-secret.
At Google and Amazon, it's mostly one repo that every engineer has access to. At MS it was more fragmented.
CorpusculantCortex@reddit
I work for a non financial software company, USB drives are locked down for us unless company provided. Nothing we work on is particularly high stakes apart from obvious tech ip. It absolutely happens and is a common feature in ms managed systems. What you are saying is just inaccurate. If it is not necessary it is both an attack vector and a leak risk and any decent infosec team will lock it down.
ANR2ME@reddit
I used to work on an e-commerce company and they also keep tracks (using security agent that must be installed) any files taken out of the office's PC (or laptop), including files uploaded to the internet, and prevent it if i don't have the rights to take the files out.
rpkarma@reddit
It’s not a problem, but they know you did it. That’s the point; they watch for data exfiltration.
SnackerSnick@reddit
I guess I'm just surprised that in my ten years between the three I never heard anyone bring it up. Someone clever could get the info anyway.
rpkarma@reddit
I dunno what to tell you, it’s extremely common and a basic feature of MDM shrug
SnackerSnick@reddit
I guess I'm just surprised that in my ten years between the three I never heard anyone bring it up. Someone clever could get the info anyway. I wouldn't be surprised if they tried to monitor, but it's essentially impossible to keep someone who knows what they're doing from copying information they have access to.
If nothing else I "tar cf -" and capture the monitor output.
itsArmanJr@reddit (OP)
i get how strict IT can be, but adding too many security layers and manual access requests inevitably kills developer velocity. if you over-engineer the friction, you sacrifice the team's ability to experiment and ship.
maybe the real question is how do these high-growth labs maintain ironclad security without creating single-point-of-failure loopholes, all while keeping that breakneck startup pace
mtmttuan@reddit
A 1b open LLM in F16 would be 2-3GB in filesize hence a 1T LLM would be about 2-3TB in file size. If you can transfer that much data to specific/un-trusted web then your company's IT sec team should be fired.
jazir55@reddit
https://rollingout.com/2026/04/08/chinas-supercomputers-lost-10-petabytes/
Hopefully they have better security than China's Supercomputing Center
d1722825@reddit
About 100TB of data was stolen from Sony Pictures without anyone noticing it in 2014.
mtmttuan@reddit
Well yeah their security team should be fired if they have that kind of team. And the fact that data leaks happened before doesn't mean it will happen again.
ksh_osaka@reddit
Yeah, but just the size isn't a good metric to measure. If you run a tax office and someone transfers the size of your entire database that's fishy, yes.
But if you run a multibillion dollar company in the movie business or are doing stuff with LLM research, I think its pretty normal that even average users will jiggle around a couple of terabytes from time to time...
mindwip@reddit
Yeah it's not always the security teams fault. A lot of the times is the c suite fault. Why do we have to spend money to protect data we never had a data breach?
No no we need these legacy xp and nt and they need internet access for bank transfers.
Yes payroll file to payroll company over ftp is fine it's point to point!
We have 1 security guy and we passed out security audit I am sure our 10,000 systems are protected?
Siems cost how much? Just use open source siem.
Stop scanning the network for vulnerabilities your crashing old network and printers devices that can't handle default password scanning.
We dont need to pay for erp support the application is running fine now. No updates since 2015, no issues.
50-3@reddit
Locking model weights behind PAM won’t affect developer velocity in the slightest. Nobody is manually changing it or reading it, well it is the secret sauce it’s largely not human readable and processes exist to iterate on it without needing personal access to its files.
Torodaddy@reddit
They manage the network ingress and egress points using packet inspection, dns blocking/tracking, protocol filtering, encryption, just layers on layers so inside the walls you can move around data but nothing leaves the network
National_Meeting_749@reddit
Because model weights aren't something most of their developers need access to, to do anything. Running inference is the valuable thing, not the actual weights.
The few, and we're talking maybe 10+ people that need it daily, that do have access are under NDA's so thick, and paid so extremely well, that the amount of country fleeing you would have to do to not have your life ruined isn't worth it when just... Not doing that will have you living better than kings could have imagined a hundred years ago.
itsArmanJr@reddit (OP)
that was a very poetic (and true) way to put it sir
cmdr-William-Riker@reddit
The bigger reason is probably stable income. If the company I worked for made anything worth leaking I could definitely leak it without their knowledge pretty easily. If I wasn't discovered, then I'd have to deal with the corporate fallout and drama and increased security which would be a major headache at the very least, and if I was discovered, then I'd be out of a job at the very least, possibly in jail or broke pretty quickly
AstroZombie138@reddit
+ never let anyone have all of the data unless you have to, and use things like fully homomorphic encryption
Few_Water_1457@reddit
NDA?
Betadoggo_@reddit
The risks are just too high for anyone to bother. If you get caught you lose a high paying job, get blacklisted from the industry, and probably get sued over damages. In the case of llama 1 it released quite publicly to researchers, then some of those researchers shared it elsewhere, it wasn't really a leak.
az226@reddit
Mythos checkpoint would be a gift to humanity.
Virtamancer@reddit
Mythos is unironically just non-lobotomized Opus 4.7 or the actually big model that they distill Opus from.
Dudensen@reddit
You've been one-shotted by Mythos marketing
Monkey_1505@reddit
I'm not so sure. It's apparently bad at instruction following. It's probably a mess/overrated, and that's the real reason they aren't releasing it.
Spepsium@reddit
It could genuinely just be too expensive to run within reason.
Monkey_1505@reddit
I think that's probably true. But their paper report the model will refuse to do some tasks unless carefully prompted, not out of any safety reason, but because it finds some tasks uninteresting. This suggests to me that their model is both useful and kind of batshit as an actual product for use.
portmanteaudition@reddit
Almost assuredly criminal charges too! Leaking trade secrets etc. is often a very bad idea, especially when the firm knows with certainty you did it.
PhlarnogularMaqulezi@reddit
I've always wondered, for the past 25+ years, how every single development build of Windows manages to find its way onto the Internet. I remember downloading multiple alphas/betas of Whistler (XP), Longhorn (Vista), and pre-RTM 8 when I was a kid/young adult.
Someone pointed out how corporate IT has lots of tracking software, so that was likely not as good back then.
Maybe it was someone at one of their OEM partners, but I definitely recall some builds having "Shhh, let's not leak our hard work" in the corner on the desktop.
edankwan@reddit
I like Open Source. But I don't think it is a good mind set. It is like for the people who only watch downloaded pirated movies come out and say people in the movie industry should leak the master of the films. Dude... It will ruin people's livelihood.
segmond@reddit
why don't you leak your family's private info on the internet?
perhaps even that annoying family member. why don't you put their name, social security number, credit card number and other PII data online? it would be relatively easy for you to do so.
Pleasant-Shallot-707@reddit
Not wanting to lose their job and be blackballed? Not wanting to be arrested and charged with a computer fraud and abuse violation?
a_beautiful_rhind@reddit
Too big and probably in some proprietary format. It would mainly help adversaries and competitors. Only way I see it happening is if someone rage quits.
Monkey_1505@reddit
What's the real benefit of doing that with current models?
They'll all be out of date in a year. Heck, in a year, there will be better open source.
Mundane_Ad8936@reddit
Guess the OP doesn't understand that corporate networks are locked down and your activity is tracked. Even if you have access there's no way to transfer it out.
SmashShock@reddit
Many large files on a filesystem that logs every action.
fuck_cis_shit@reddit
most lab employees have no direct access to model weights, for one thing. only a few folks directly involved in training them, and you can bet everything they do is carefully monitored
the original llama wasn't actually "leaked" in the sense of "Meta didn't want it released". it was already out in the wild to hundreds of academic users
az226@reddit
And something like Mythos is 6TB.
_bones__@reddit
SanDisk has an 8TB micro SD card. They could do a full search of your belongings, but you could hide that under your foreskin.
Hmm, maybe that's why the closed source models are US based... /s
az226@reddit
That’s assuming the OS isn’t locked down on file transfers to external drives. Big if.
MadGenderScientist@reddit
also meta lowkey wanted it leaked
portmanteaudition@reddit
Probably giga-encrypted even if you do!
Expensive-Paint-9490@reddit
Because it is impossible to do it anonymously, if the IT department knows its job. When Miqu was leaked, Mistral was able to track the leaker in no time.
Leaking proprietary tech would be a serious crime in USA. I think.
__JockY__@reddit
I’d invert your question: what could possibly motivate someone to take such a reckless and self-destructive act?
Not much, I reckon.
DismalIngenuity4604@reddit
Police and jail.
Laoweek@reddit
People tend to not want to go to prison??
Ferret_Faama@reddit
I think people outside of these companies greatly underestimate that these are highly sought after and well paid positions. There are few people willing to throw it all away for virtually no benefit other than internet cred. I'm not saying they don't exist at all, but it is very rare.
YourVelourFog@reddit
Even less so for AI. I can think of a bunch of different fields where the correct information is worth a lot to others. High frequency trading models, secret formulas for various brands, the internal specifications for defense products (radars, planes, warships, etc).
The list goes on and on. The weights to some AI company's LLM is pretty far down on the list for me.
CrowdGoesWildWoooo@reddit
The people having access to these weights are currently 8 or even 9 figures salary. It would be pretty darn stupid to just throw it away.
howardhus@reddit
Jimmy In n Out
YourVelourFog@reddit
Why don't engineers release code from high profile companies that make loads of money? Why doesn't a HFT engineer release their internal code or sell it to another company for money?
Because people just want a good job and to be comfortable. Why get fired and spend years in prison for other people who you don't know?
Square-Hornet-937@reddit
Probably same reason as to why most peple don’t just go steal random things from work.
HumbleRhino@reddit
Didn't they suicide a guy already connected to ai
FreQRiDeR@reddit
Well for one Mythos is a 10T parameter model. As in TEN TRILLION! That’s not gunna fit on a floppy!
TechnoByte_@reddit
Mythos is a bunch of marketing hype
https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier
DelKarasique@reddit
Lose your job and go to prison just to release weights for model, that will become obsolete in a year?
t4a8945@reddit
I'm more surprise we don't even have basic stats leaked. Architecture / model size / quants used. That's basically harmless info.
yensteel@reddit
One way to ensure that data is "hard to leak", is to use VMWare horizons on a corporate laptop. The laptop has a rotating password that one must change every period.
No USB or other peripherals are allowed. Only HID.
The laptop must be scanned by authorized software every period.
The laptop only serves to be a tunnel to a VM that the company controls. Every action and access is logged.
There's at least three problems with this approach, and I won't disclose here because I really don't want to cause any issues. VM tunneling solutions are only precautionary measures, not foolproof.
Ferret_Faama@reddit
You're not wrong, but this type of system is not often used for developers because it's such a pain in the ass. It's usually not technical measures that actually stop it from happening, but legal ones.
Wubbywub@reddit
firstly by having their own companies pay them enough to not want to go against it? (and company vestings that are terminated if they break contract)
twnznz@reddit
Probably a few things, e.g.
- no open Internet for inference systems or systems storing models,
- inspection gateways between those systems and Internet,
- jumphosts between staff systems and inference systems with monitoring,
- no storage path between staff systems and inference systems (scp/sftp disabled etc),
- EMS and monitoring of staff endpoints,
- data exfiltration detection on Internet paths,
- AI on top of all that with good SIEM and SOAR to automatically lock out suspicious activity
infinitelylarge@reddit
The big labs do use take home laptops, but for almost all employees, those laptops do not have access to the portion of the datacenter network where weights are stored. For those few who do have access to that portion of the network, the bandwidth on their VPN connections to the datacenter is tracked. Downloading a 1T parameter model is not a hard to see (or automatically detect) in a network bandwidth graph. So only a tiny number of well vetted people can even reach the weights, and those people know that their network traffic is being watched and they would immediately be caught if they downloaded a huge model.
AnotherBrock@reddit
stay in company that pays you >100k a year, you can potentially get stock options, company goes public and their now worth millions
theres probably a lot of incentive not to leak the weights
Late-Abies-25@reddit
dumbass post delete ts
Minato_the_legend@reddit
What are they going to do? Send a 1 trillion parameter models over whatsapp?
Zyj@reddit
I believe the weights get stolen sometimes, the people who do it just don’t publicize them, they sell them instead.
LeRobber@reddit
I think that DID happen. IIRC there was a mistral leak years ago?
Anthonyg5005@reddit
Yeah, from what I read I think it wasn't anyone at mistral. It was someone from a different company they gave the model to for private inference
jax_cooper@reddit
the whole Local LLM thing started this way with meta Llama 1
Anthonyg5005@reddit
They'd probably easily get caught through logs and get in trouble. Llama wasn't leaked, it was public but required you to request for access but reuploads to hf is what made it get more attraction
Fine_League311@reddit
Frage eines Kindes?
GatePorters@reddit
The Danversittle. They release it on anyone who leaks. It can smell deception. The Greeks summoned it to find Judas. This is also how they found Snowden, but the UN passed a vote to veto the Danversnittle’s extradition because then Putin would have retaliated by invading UKR.
Now that they are at war with UKR already, there is no legislative body that can stop the Danversittle. And that makes a huge difference.
Also, I’m pretty sure they would face HEAVY criminal charges even if the Danversittle ends up being another hoax.
Enough_Big4191@reddit
not a dumb question, but it’s a lot harder than it sounds in practice. weights at that scale aren’t a single file u can just download, they’re sharded, access controlled, and usually behind pretty strict infra boundaries. also most places assume insider risk, so there’s heavy logging, access scoping, and anomaly detection around large transfers. someone could try, but it’s high friction and very detectable, not a quiet copy to a usb situation.
StewPorkRice@reddit
bro why...
they are all waiting to become multi millionaires..
EbbNorth7735@reddit
Everything would be cloud based. They likely can't download model weights. Google and Amazon host the models but againt hose would be under tight and strict access controls.
Torodaddy@reddit
Got any idea how large those models are, we're talking likely a TB at least, you cant exfiltrate that out of a corporate network without alarms going off
fallingdowndizzyvr@reddit
Jail.
It wasn't leaked leak. Meta allowed people to have it, you just had to register. But someone released it into the wild. Nothing has really changed. Meta still had people register, provide an email, with every version they released. Then others would reupload it to "leak" it so that people could download it without IDing themselves to Meta.
Serprotease@reddit
If we trust the few/untrusty bit of information we got, these models are in the 1-2-3T range. So a few Tb each. It’s quite unlikely that they are just randomly sitting on someone laptop and you will need an hefty ssd+enclosure to just copy them off. Over the internet, I do hope that have system to catch on tbs of data moving outside the company.
But even on top of that, I’m pretty sure that not one llama.cpp or vllm maintainer will merge any pr to even run those weight. Something kinda similar happened quite sometime ago in the sd space and automatic1111. He was more or less blacklisted by the other actors.
Lastly, that’s a lot of risk for weight that will be obsolete within less than a year. Does anyone still use sonnet 3.5, gpt4 mini or Qwen2.5 72b (Aside from clueless vibe coded apps?)
Dismal-Effect-1914@reddit
If it were me i'd wait until we actually had an ASI level model then leak it and peace.
o5mfiHTNsH748KVq@reddit
The big models just stay in the cloud and aren’t necessarily trained or downloaded locally.
Small models is basically trust and threat of legal action, just like at any other company.