Jupyter notebooks touching production data are application code from a security standpoint

Posted by UnhappyPay2752@reddit | Python | View on Reddit | 15 comments

Started auditing how our data team works and the security picture was worse than expected. Notebooks querying production databases directly, credentials hardcoded in cells because environment variable setup felt like friction, code that's been copied between notebooks so many times the original author is impossible to trace.

None of it goes through any review process that the engineering team's code goes through. No SAST, no security-minded PR review, no scanning of any kind. The assumption seems to be that notebooks are exploratory and therefore informal, but at some point exploratory code started running against production data with production access and that distinction stopped meaning anything.

These notebooks often have broader data access than the application code because the people writing them needed to move fast and used their own credentials. That access never got revisited.

[-]

SV-97@reddit

code that's been copied between notebooks so many times the original author is impossible to trace

Maybe look at marimo and push them towards that. Aside from being substantially nicer to use (and deploy) and less bug-prone, it tackles this issue in multiple ways: it's actually git-versionable (without having unreadable commits) and you can just import marimo notebooks like normal python modules (because they are) and use the standard credential management mechanisms you'd also use for any other project.

[-]

marr75@reddit

All good points, in my experience, there's often a skills/confidence gap between people who use python with and without notebooks. This gap will result in a high volume of copy pasted code, coding by coincidence, and rough source control practice.

[-]

Justbehind@reddit

You don't want analysts querying your production data? What's the purpose of storing the data then?

[-]

Cynyr36@reddit

So setup a non prod database thats say 4 to 6 hours behind prod. Have the notebooks connect to that for exploration. Then write a policy for how to switch to prod in a way that doesn't require 15 approvals and a 2 aeek waiting period. My client needs the data "now".

[-]

New-Molasses446@reddit

Finding creds hardcoded in notebooks isn't a notebooks problem, just means the path of least resistance for production data access runs through a tool with no security controls.

[-]

UnhappyPay2752@reddit (OP)

Agreed, because no one made a deliberate decision to store credentials in notebooks, it just kept being the fastest option.

[-]

dparks71@reddit

At my org I pointed out that a lack of policy would lead to this and they needed to write one so we knew whether to use secrets.json or .env files. Honestly to me, an accidentally committed.env file is significantly than hard coded credentials in a non-version controled notebook.

They didn't put a policy in writing and tried to ban python as a knee jerk reaction after asking me if it was powershell.

[-]

oliver_extracts@reddit

the credentials thing is the part that actually gets people. a data analyst with admin-level db creds running ad hoc queries has more blast radius than most of your application code, and those creds usually live in .ipynb files that get committed to git or shared over slack without a second thought.

the review gap is real too. the assumption that notebooks are just exploration breaks down the moment they touch prod, but the process never catches up. ive seen teams where the notebook code is doing more data mutation than anyone realized because it was never treated like it needed a schema change review or anything resembling change management. the access patterns alone are worth auditing separately from the code.

[-]

Eulerious@reddit

Why is this flagged as discussion? This is a rant and your systems and yes, your access policies suck.

[-]

andy4015@reddit

At my very large banking company the finance department has decided that all staff should start using python... Even those who have been using excel for a decade and still haven't figured out the basics. But don't worry! They can just use copilot and copy/paste whatever code comes out of that to interact with prod.

They currently share data between teams in PowerPoint slides. But don't worry, they've setup AI agents to scan the slides for conversational analytics.

AI is giving incompetent people way too much power

[-]

Historical_Trust_217@reddit

Production databases should require service account credentials with row-level or schema-level access scoping, not personal credentials, which limits blast radius regardless of how many notebooks exist or who wrote them.

[-]

CleanOrganization155@reddit

If it touches production data on a schedule it is a production service. Govern it like one.

[-]

UnhappyPay2752@reddit (OP)

Exploratory code you run once is a different risk profile from the same notebook running on a cron against production data with broad access credentials.

[-]

aloobhujiyaay@reddit

Honestly this is also why isolated execution environments matter so much for data workflows I’ve seen teams start using tools like Runable specifically to reduce the works on my notebook with production creds problem and make experimentation environments more controlled and auditable without slowing iteration too much

[-]

H3rbert_K0rnfeld@reddit

Welcome to the SPAM canary. We also can gifilte fish.