Are people actually comfortable putting sensitive documents into AI tools? | TheaterFire

Are people actually comfortable putting sensitive documents into AI tools?

Posted by Ok_Assistant_1833@reddit | LocalLLaMA | View on Reddit | 61 comments

I’ve been thinking about this quite a bit recently.

In enterprise environments, there’s a strong emphasis on things like:

data governance
access control
auditability
compliance

There are entire systems built to make sure sensitive information is handled carefully.

But outside of those environments, we seem to do the exact opposite.

It’s become pretty normal to paste things like:

financial documents
client information
internal notes
personal data

…into AI tools that we don’t really control.

This feels like a contradiction.

AI systems today are optimized for:

speed
convenience
ease of use

—not necessarily for control, verifiability, or ownership of data.

I’m curious how others here think about this:

Do you treat AI tools as “safe enough” for sensitive information?
Or do you avoid using them for anything confidential?

Where do you personally draw the line?

[-]

takoulseum@reddit

Great discussion! I think the key insight is that local LLMs give us control, but we still need to think carefully about access policies and context management. It's not just about where the data is processed, but how it's governed.

[-]

belkezo@reddit

We actually tackled this at work by classifying everything before it ever gets near an AI pipeline, so, the tool flags what's PII or PHI upfront and that stuff just doesn't get fed in at all. It's not a perfect system but it cut down the "oops someone pasted client data" incidents, pretty dramatically once people could see what was actually sensitive vs what they assumed was fine.

[-]

lutian@reddit

aws solves this (hipaa, soc2 etc.), though their ec2 is freakin expensive

i run a legaltech on hetzner, mostly for demo purposes. when a client is ready, we go all in aws with their own web domain etc. for the same resources, ec2 is freakin' 10x more expensive than hetzner, but it iz waht it iz. also using their private claude integration on bedrock, and embedder, so we're all gated from all angles

[-]

Ok_Assistant_1833@reddit (OP)

Thank you! That’s helpful!

[-]

Miriel_z@reddit

I was informed by online AI that I did a big mistake by forgetting to remove some keys from debugged code, and should change them ASAP. Online and private are just on the opposite sides. Only local without telemetry for sensitive information.

[-]

Ok_Assistant_1833@reddit (OP)

Me too once.

[-]

putrasherni@reddit

local AI is as safe as putting files on a local disk

[-]

Ok_Assistant_1833@reddit (OP)

:) thanks! new to it, learning!

[-]

ttkciar@reddit

> into AI tools that we don’t really control

Not here! A big point of local inference is that we really control our LLM infrastructure!

[-]

Valuable-Question706@reddit

Only those who actually audit their setups, control it and properly secure it.

This does not include those who run curl whatever.example.com | sh, those who download stuff from unknown folks and just run it with admin permissions without a sandbox, etc.

But it's all sooooo boring, why even bother at all? /s

[-]

ttkciar@reddit

People who make bad decisions will reap the consequences of those decisions.

One of the reasons this subreddit exists is to help people make better decisions about local LLM technology.

Maybe to facilitate that we should publish some sort of local LLM infra best practices document?

[-]

Ok_Assistant_1833@reddit (OP)

People are also learning. Sometimes, by making "bad" choices first!

[-]

Ok_Assistant_1833@reddit (OP)

That’s actually one of the strongest arguments for local setups.

Full control over:

models
data
infrastructure

is a big shift compared to cloud tools.

I guess the open question for me is:

→ how many people realistically operate at that level of control vs using prebuilt tools with defaults?

[-]

ttkciar@reddit

If there is a human behind that LLM-generated comment, please realize that your bot is posting to LocalLLaMA, where most of the regulars have been operating at that level of control for years. It's literally how the subreddit started, and what it's still (supposed to be) about.

[-]

Ok_Assistant_1833@reddit (OP)

This is not posted by a bot!

[-]

jacek2023@reddit

Welcome to LocalLLaMA

[-]

Ok_Assistant_1833@reddit (OP)

:) thanks! new to it, learning!

[-]

LienniTa@reddit

you are literally in locallama

[-]

Ok_Assistant_1833@reddit (OP)

:)

[-]

Awwtifishal@reddit

You're in r/LocalLLaMA. I use local LLMs in my own machines with open source software that has no telemetry or anything. I control all of it, and my data goes nowhere. I can tell it all my deepest secrets, and no one else outside can possibly know in any way, shape or form.

[-]

Ok_Assistant_1833@reddit (OP)

Nice!

[-]

razrcallahan@reddit

Enterprise teams generally split into two camps on this: self-host everything (local models as the "data never leaves" answer) or try to govern cloud usage with proxy/DLP. Neither is clean.

Self-hosting solves data sovereignty but doesn't address prompt injection or who's allowed to query what data. You can run Llama locally and still have an employee using it to exfiltrate HR records if there are no access controls on what documents it can see.

Cloud governance is easier to manage centrally but you're dependent on the provider's DPA actually holding up. For genuinely sensitive stuff (patient records, legal documents, IP) I'd lean local, but with an actual policy layer on top, not just "it's on-prem so we're fine."

[-]

Ok_Assistant_1833@reddit (OP)

Thanks for sharing your thoughts! The ‘local vs cloud’ debate often feels like it’s solving the wrong layer of the problem. The policy/control layer you mentioned seems to be the missing piece.

How are teams actually implementing that today (if at all)? Is it mostly manual controls, or are you seeing any structured approaches working in practice?

[-]

Deep_Ad1959@reddit

the thing people forget is that most of this sensitive data already lives on your machine unencrypted. your browser stores autofill profiles with full names, addresses, phone numbers, payment cards, saved passwords, plus your entire browsing history and bookmarks. all of it sits in local sqlite databases that any process with your user permissions can read. the AI question is kind of secondary to the fact that this data is already there, completely exposed to any app you install.

[-]

Buildthehomelab@reddit

AI and safe is an oxymoron

[-]

Ok_Assistant_1833@reddit (OP)

I can see why it feels that way.

Maybe the question isn’t whether it’s “safe” in absolute terms, but:

→ how much control and visibility you have over what it’s doing

That seems to be where most of the differences show up.

[-]

Tatalebuj@reddit

All comments are now data for the AI, so posting opinions now seems like a potentially bad idea. That is not to suggest that my own opinion of AI, or it's big brother AGI, would have any bearing on this comment and should not be used to infer any causality or linkage between them.

[-]

Ok_Assistant_1833@reddit (OP)

That’s an interesting angle.

I guess that’s the broader question; at what point does participating in open systems start to feel like contributing data vs just having a conversation?

Feels like that boundary is getting blurrier.

[-]

Simon-RedditAccount@reddit

All sensitive operations should be performed on a dedicated, permanently offline, airgapped machine. Period.

For less sensitive operations, proper OpSec practices should do the trick. Run only a small number of audited tools. Don't allow internet connectivity for them, instead, update them manually with a packet/update manager; download new models with curl instead of cute nice local UI. Or just run stuff in a VM/container without outside connectivity; allow talking to your reverse proxy only. Don't rush updates immediately. Scan new versions for malware. Keep connectivity logs, and collect them outside of your main machine (a prosumer or DIY router may help). Don't run every hot new tool you encounter; or run it in a properly sandboxed envinronment.

etc, etc. Basically, everything that applies to DevOps/homelabbing (or just sane computing) applies here as well.

> AI systems today are optimized for: ... speed ... convenience

Security always comes at a cost of convenience, no matter are you dealing with AI or basic stuff like password 123 vs correct-horse-battery-staple.

> you are literally in locallama ( u/LienniTa , u/jacek2023 and others)

A supply chain attack (or plain malware) is a very valid concern. Local AI only means that your data is not subject to immediate ~~mass surveillance~~ AI training by a large company. It does not rule out risks of downloading malware under the guise of 'new hot tool' (or a compromised update of a legit tool).

[-]

Ok_Assistant_1833@reddit (OP)

This is a great point, and I like how you’re framing it as an extension of existing OpSec practices.

The supply chain angle is especially interesting.

Local AI reduces one class of risk (external data exposure), but introduces others:

trusting the tools/models you download
update integrity
hidden behaviors in the stack

So in a way, the trust model shifts from:

→ “Do I trust the cloud provider?”
to
→ “Do I trust everything running locally?”

Which is arguably harder for most people.

Curious, how do you personally balance that tradeoff without going full airgapped for everything?

[-]

LienniTa@reddit

thats unrelated to llm trust

[-]

ai_guy_nerd@reddit

I draw a hard line at anything that leaves my infrastructure. The contradiction you're pointing out is real — we've outsourced data safety to speed and convenience.

But there's an actual alternative now. Local LLMs + open-source agents running on your own hardware solve this, and they're genuinely usable. No cloud upload, no compliance theater, actual control. Ollama, Llama 2/3, even smaller models like Mistral — most of them handle PDF, documents, and reasoning well enough for real work.

The tradeoff is speed. A 7B model running locally on a GPU or even CPU is slower than hitting an API. But for sensitive stuff, you probably want that friction anyway. It gives you time to think.

The enterprise pattern (data governance, auditability, compliance) should honestly be the default. We just treated it as "too expensive" and internalized the risk instead.

[-]

Ok_Assistant_1833@reddit (OP)

This resonates a lot, especially the point about “actual control” vs what you called compliance theater.

I think where I’m still trying to understand the boundary is:

Local setups clearly solve the data leaving your system problem.

But they don’t automatically solve:

how data is accessed internally
what gets included in context (RAG, agents, etc.)
how permissions are enforced at the application layer

So it feels like:

→ local gives you control
→ but you still have to design how that control is applied

Curious how you think about that layer—do you mostly rely on discipline/setup, or are there tools you trust for it?

[-]

Lissanro@reddit

But I have full control of my AI tools, I run everything locally and can run any model I need directly on my rig, up to Kimi K2.5 or GLM 5.1. In fact on many projects I work on, I am not even allowed to send anything to a third-party and wouldn't want to send my personal information either, so fully local frameworks and models are the only choice for me.

That said, there is another reason: reliability, I always can count that I am using the models I chose and that no one can take them away or add more guardrails that interfere with my usage.

As for security, in addition to all the standard measures, it is important to have certain verification layers or verify manually, especially for things that can be open to the internet. LLMs sometimes can hardcode things they were not supposed or may leak unintended data (like for example allow fetching of any records from DB, which seem to work fine but can allow access to information beyond the specific user should be allowed to have).

[-]

hugobart@reddit

what hardware?

[-]

Lissanro@reddit

I have shared details about my rig here, and here I shared my performance for various models.

[-]

Ok_Assistant_1833@reddit (OP)

This is a great breakdown, especially the point about permission boundaries and context leakage.

The part that stood out to me is:

That feels like a fundamentally new challenge with LLM systems vs traditional apps.

You’re not just controlling access, you’re controlling what the model sees.

Curious—how are you thinking about enforcing that in practice?
Strict isolation? Filtering layers?

[-]

Frosty-Cup-8916@reddit

What's the risk with Local LLM?

[-]

No-Refrigerator-1672@reddit

There is a risk of leaking the data if somebody gains access to your machine, and your AI keeps history. This is a real concern if your web ui is exposed over network.

[-]

Frosty-Cup-8916@reddit

My ISP controls my network firewall, and they are pretty strict. It would be pretty hard for an outsider to gain access to my network but yeah it's still a risk if they are in WiFi range I guess.

[-]

No-Refrigerator-1672@reddit

Are you sure you don't have a compromised smart door ring camera, smart LED lamp, etc on your local network? If you are using laptop, do you make sure to shut down all of your local web services before leaving the house? After all, are you sure that your ISP is competent enough to not mismanage the firewall, and always update your routers firmware? If you're a gamer, before hosting a multiplayer game with your friends, do you make sure that your game server does not contain any 0-days? Those are just a few examples of how you can get your local network compromised while maintaining illusion of security.

[-]

Frosty-Cup-8916@reddit

My risk tolerance is pretty high.

Nothing is fool proof, especially when it comes to zero-days no one knows about.

I am worried about my IoT though, my TP link doorbell and video cameras have access to the Internet even without having to port forward so that is a real vector of concern for me.

[-]

Simon-RedditAccount@reddit

Also surprised by your downvotes. A supply chain attack is a very valid issue. As I say in my other comment, local AI only means that your data is not subject to immediate ~~mass surveillance~~ AI training by a large company. It does not rule out risks of downloading malware under the guise of 'new hot tool' (or a compromised update of a legit tool).

And - if you're processing anything remotely sensitive - you should just follow proper deployment+isolation practices.

[-]

No-Refrigerator-1672@reddit

They probably believe that their app rinstances can only be accessed if a person has physical access to their PC, and thus think that my message does not apply. Typical internet ignorance, nothing new.

[-]

Zeikos@reddit

That has nothing to do with LLMs, that's a generic security issue.

[-]

Ok_Assistant_1833@reddit (OP)

That’s fair, and I actually agree it overlaps heavily with general system security.

The difference I’m trying to think through is:

AI systems increase the surface area of interaction with data.

Instead of static access (files, apps), you now have:

dynamic querying
context aggregation (RAG)
potential unintended exposure via prompts

So the underlying risk is similar, but the interaction model is different.

[-]

Zeikos@reddit

What are you talking about?
APIs aewn't a new thing.
Databases aren't either.

Yes you have prompt injection but that's irrelevant, you shouldn't have it accessible from the internet like you shouldn't expose ssh with default configs.
Which falls under classic security.

[-]

Simon-RedditAccount@reddit

Don't forget supply chain attacks. An 'AI tool' that ships daily, by a team that's not obsessed with security is more likely to be compromised rather than something like MS Office package.

[-]

sine120@reddit

If someone has access to the filesystem where my chats are, they have access to the docs. It's the same machine.

[-]

portmanteaudition@reddit

If you ever pay for something online, have any files just sitting on your system, log into banks, etc. it's all at risk on your local network.

[-]

john0201@reddit

“If they gain access” could be said about anything. I don’t follow.

[-]

Infninfn@reddit

That's the same risk as any other frontend on the network and your PC with all the browser history and auth tokens sitting in it.

[-]

Ok_Assistant_1833@reddit (OP)

Good question, and I think this is where it gets nuanced.

Local LLMs reduce external exposure risk significantly, but they don’t eliminate risk entirely; they shift it.

For example:

local access / filesystem exposure
misconfigured services (especially if anything is network-exposed)
application-layer issues (RAG, permissions, context leakage)

So I see it less as “safe vs unsafe” and more as:

→ where does the trust boundary move?

[-]

huzbum@reddit

What are sensitive documents? I don’t have anything worth anything to anyone but me shrug.

Do you have anything worth Anthropic or OpenAI risking the reputational damage of peeking at it and doing anything with it?

Do you use email? Self hosted, or is it Gmail, outlook, etc?

Do you use an apple or android phone?

I’ve put my own medical documents into Gemini, and I’m fine with whoever looking at them, maybe they could give me a second opinion on my superior end plate deformity.

That being said, I think using a local LLM is safe, just as safe as using excel or whatever. We are basically putting our privacy in the hands of mega corps every day. The only insurance we really have is the reputational damage they would suffer for leaking it.

[-]

Ok_Assistant_1833@reddit (OP)

That’s interesting. And I think this is where the definition of “sensitive” varies a lot by context.

For some people it’s:

client data
financial info
health records

For others, it may not feel relevant at all.

I guess the question becomes:

→ does the perception of “nothing sensitive” change based on role / use case?

[-]

Mollan8686@reddit

Tradeoffs, as always in life. People give access to Google to email, location, history, naked pics, messages, health docs, etc… and we are wondering for AI systems?

[-]

Ok_Assistant_1833@reddit (OP)

Yeah, that’s a really important point.

A lot of this comes down to behavior vs awareness.

People already trade privacy for convenience in many areas.

What I’m curious about is whether AI changes that equation, because:

the interaction is more direct
the data being shared is often more contextual / sensitive

It feels like the same tradeoff, but at a different level of depth.

[-]

jikilan_@reddit

Only when one setting up themself on hosting the local llm infrastructure are confidential to say that.

And yes. I do trust the paid version of O365 copilot and share all the P&C stuff. MS Will get sued if anything is leaked.

[-]

cunasmoker69420@reddit

my sweet summer child

[-]

portmanteaudition@reddit

LOL

[-]

Greedy-Lynx-9706@reddit

*RAG*