CrowdStrike outage made me realize we're doing cloud security backwards
Posted by Tiny_Habit5745@reddit | sysadmin | View on Reddit | 125 comments
recent crowdstrike outage broke everyone's brain, but it got me thinking about cloud security. We have perfect CI/CD scans. Every container checked, every policy reviewed. Security dashboard all green.
Last week a dev accidentally deployed a test service that hammered prod with 10k queries/sec. Perfect security posture... no CVEs, minimal privileges. But at runtime it destroyed our infrastructure.
CSPM saw everything as compliant. But nobody watched what it actually did when running. We obsess over static posture while being blind to runtime behavior. Found containers making sketchy API calls, services with excessive network access, processes doing random stuff. All "secure" according to scans, all dangerous in reality. Reality happens at runtime.....
Fatality@reddit
Ad?
Kardinal@reddit
Zero post history. Makes me wonder.
jamesaepp@reddit
I've seen this a ton lately. Suspicious post, click on the user .... 0 history. Lots of karma though for accounts that are only a few months or a year old.
Catsrules@reddit
Guessing it is because Reddit started allowing users to hide their post history.
https://www.reddit.com/r/reddit/comments/1l2hl4l/curate_your_reddit_profile_content_with_new/?share_id=at-rlQxTzHi9bs2z5nOqp
Most of the time I am onboard with more privacy, but not for Reddit. Reddit is for anonymity not privacy. This just makes it harder to identify bots.
Kardinal@reddit
Reddit is a place for anonymity not privacy.
Good way to put it.
Hiding post history is basically hiding context. We should be accountable for our words at least to some degree.
BatemansChainsaw@reddit
It aids in anonymity by preventing people from collecting the bits of wayward data and makes it harder to figure out who someone is.
Take each comment for what it is at face value instead eh?
Catsrules@reddit
That is fair point. I personally would recommend just getting separate accounts if that is something you are worried about. There is a huge advanced of just keeping Reddit fully open hiding things sucks.
Most of the time that is fine, but there are occasions where I would like to know more about an account. For example if they are recommending something. Or if a post might seem a little weird and it is/was nice to have the option to just verify the account is a real person, maybe English isn't their first language and they are just a little off from a native speaker.
jamesaepp@reddit
I think we as society need to adopt a half your age + 7 years rule to this.
Ok_Frame8183@reddit
some of us are deleting our posts. not feeding the AI machine for free.
Catsrules@reddit
Well you will be counted as a suspicious account in my book.
I would bet AI bots grab posts fairly quickly. If you want to stop AI from grabbing your post don't post at all.
Ok_Frame8183@reddit
fuck it big homie, we gotta try lol.
mac412@reddit
Active Directory
awesomenessjared@reddit
And/or an AI Spam bot...
webguynd@reddit
“Security” to me has always meant “availability” as well as typical protection from malicious actors stuff. Prod going down is just as much of a security incident as compromise IMO and yes I agree, most orgs don’t do enough monitoring and instead rely on tools for detection which as you’ve found is flawed.
Monitoring should be step 0 not the last step. You have to have visibility into everything or there’s no possible way to have good security.
My hot take is most “security tools” are largely just grifts. They are expensive and made simply to check boxes on audits without actually providing much real word benefit. Like you said, they are focused on posture and not much else. IMO this gives orgs an illusion of security which is more dangerous.
HeKis4@reddit
I've been taught that confidentiality is only one of the 4 pillars of "security", the two others being availability and integrity. We very often focus only on the first.
J-Cake@reddit
This hurts because neither are security related
HeKis4@reddit
Are they ? A DDoS attack or ransomware are definitely security threats, yet they are "only" attacks on availability and integrity respectively.
Elfalpha@reddit
No, I'm pretty sure they are. If you think in physical terms, sabotage is just as much of a security threat and responsibility as theft is. Compromising data availability and integrity is sabotage.
https://www.fortinet.com/resources/cyberglossary/cia-triad
J-Cake@reddit
I mean sure, but in my mind the best kind of security is a failsafe security system. You shouldn't need to rely on a service being up for it to be safe, y'know?
jaank80@reddit
This hurts because shows total ignorance of information security. CIA triad is the basis of every good infosec program.
randommm1353@reddit
War flashbacks to the CIA triad
PristineLab1675@reddit
A NAS filling up and going down, bringing down shared storage for users and servers, is definitely NOT a security issue. Availability of systems is a pillar of security, but a huge amount of availability does not involve security at all.
Here’s another one - your colo provider goes dark, power outage. Is that a security issue? Not at all.
Now, when your internet facing apps are met with a ddos, now you have a security focused availability issue.
You sound like you are around people doing security and don’t fully understand what they do or how they do it. Some tools are check boxes. Some tools are incredibly effective. Look at cribl and abnormal as two examples of dedicated security products that actually pack a punch. Then, alternatively, you have stuff like rapid7 finding vulnerabilities in a narrow range, when a tool like wiz does the same thing in a broader scope and better.
jwrig@reddit
It could be a security issue depending on what caused the files to be written. Availability has been a pillar of security forever.
PristineLab1675@reddit
What I’m hearing is that the businesses IT disaster recovery and business continuity plan fall under the security team, is that right? The security team should lead the colo, the security team should be in charge of backups. The security team is responsible for any and all availability issues, is that right?
Do you have a single example of a successful cybersecurity incident where the malactor wrote a bunch of data semilocally to cause an outage? Why would an attacker fill up a shared drive? That doesn’t make any sense. If you can show me a single scrap of evidence that there is malicious benefit to writing a ton of data to a shared drive, I would be willing to concede the point, but I don’t think you can.
Availability is absolutely a security responsibility. From a top down business perspective, availability is much more broad than security, and the huge majority of that burden is on it support
jhaar@reddit
I'm being pedantic, but I think it can be argued "availability" falls under Ops rather than Security. I put it in the same camp as diskspace monitoring. Do you have 20K workstations? Do you wonder why 300 are missing months of patches? Did you know they all have less than 1G free diskspace? users don't notice/care until what they are doing fails, but modern patching/AV/EDR need tonnes of free diskspace to update successfully. Similarly, CPU, memory, network bandwidth are all resource allocation issues, and all have fatal consequences when ignored. Yes DDoS is a thing, but that's about the only availability thing can crosses over to the security side of the divide. Ironically, all Cloud API services have built in maximum calls/sec/customer in order for the service provider to control their costs. i.e. they deliberately build availability failure into their products and blame the customer 😄
RoundFood@reddit
I don't even think this should be a hot take. So many practices in security are intended to protect for things that never happen or to cover some compliance requirement where people recognize that there's a potential for foul play even though in the real world it basically never happens. Truth be told, enforced passkeys and a good XDR/EDR and your ahead of 98% of organizations. Like has their ever been a real case of someone walking into a office building and plugging their laptop into an ethernet port to hack the organization? Has this ever happened a single time in the any developed country? Serious question, I'd love to read about any cases, I haven't come across any.
My hot take actually contradicts your main point and the triad. I don't think availability is actually security... unless there's a malicious actor involved. That A in CIA is too broad, something going down to me is operations, it's infrastructure, it's hardware, it can be any number of things, it's not necessarily security... unless there's malice, then it's security.
kangy3@reddit
You gotta think deeper though. Hardware going down is completely related to the security of the business, just in a deeper sense.
RoundFood@reddit
Sure, someone's laptop failing effects availability, that's obvious. But is that something for the security team to look into? I appreciate that with a strict reading of CIA it's part of security but practically it's not really the same discipline.
kangy3@reddit
Yeah.... I wasn't thinking that exactly. More like other critical infrastructure. We learned the hard way once when a core switch died, that previous staff mixed up the cables so redundancy was not there. Brought everything down.
zephalephadingong@reddit
In smaller orgs I think that makes sense. In larger ones availability should be almost entirely disconnected from security. Sometimes those two needs will clash, and the differing viewpoints should be able to fight it out with each side being committed to their own goals IMO
wrt-wtf-@reddit
If you’re using Crowdstrike to assist in assessment these issues can be seen.
Having said that profiling and application as it runs seems to be missing from your solution. Checking for external calls to systems not agreed in the security assessment (and network flows) should have sounded alarms prior to production deployment.
I’ve had this in my solutions for a very long time as it was a major issue that kept popping up in the 90’s as devs started to put phone-home and stat counters in their libraries. This behaviour continues today in many libraries and has nothing to do with the functioning of the software that uses them. In modern terms the concern is toward supply chain attacks and these types of calls setting up a vector for attack with seemingly innocent traffic talking to a CnC and having extra fun and games taking off…
Why isn’t this being validated? Dunno. Should it be? Absolutely, especially if you’re in a devops world that isn’t taking into account supply chain attacks from public repositories, potentially compromised commercial libraries, or commercial Trojans supported by state actors.
enigmaunbound@reddit
You aren't wrong. Most security frameworks include monitoring as the last step. That is tough to do. Ideally you would have established KPIs for various activities. Then baseline those KPIs to identify things like your API Hammer Time. Does your dev process have a deploy and test concept?
WackoMcGoose@reddit
...I giggled at the phrase "API Hammer Time".
"Looks like the hammer... just found a nail!"
BdoubleO100 hit the server too hard
"Dang it!"
Symbolis@reddit
Wasn't expecting to catch a BDubs reference here.
fillgeez@reddit
Wonder what kinda overlap there is between sysadmins and hermitcraft or just some of the old minecrafters in general.
I know modded minecraft (especially stuff like computercraft) is a pipeline to sysadmins for sure. Monitoring, scripts to automatically process things, crashing your network and having to go hunt down rogue machines.
Arudinne@reddit
Running any sort of game server is a gateway, especially once you start modding it.
Sadly a lot of games don't do custom servers anymore.
Turdsindakitchensink@reddit
That’s what’s got me started. Running local LAN parties. Awesome times.
Arudinne@reddit
A LAN party is where I first got into Minecraft. Way back during Alpha 1.1.2 in 2010. I remember boats could crash the server and doors often didn't work properly. Fun times.
Advanced_Vehicle_636@reddit
Indeed. I'm hoping we see the return of the modding community with Arma 4 though. Arma was a great series of game.
SMS-T1@reddit
Don't forget Elder Scrolls 6.
Whenever that comes out, I expect loads of modding activity.
Drywesi@reddit
Watch it has always-online so Copilot can help you select approved mods to install.
Morkai@reddit
I never played any of the ARMA games but Reforger looks like a good time. I've watched a few videos of this British guy "Will From Work" and it looks like a ball. Mix of boots on the ground run and gun, or cruisy logistic package delivery stuff, or base building stuff.
zeroibis@reddit
Yea and a lot of games died early because of it. Tribes: Ascend comes to mind as a game that really would have been improved if you could have self hosted.
ChatonIsHere@reddit
Oh don't get me started, I swear I spend half the day dealing with VLANs only to go home and have to VLAN my AE2 ME system in Minecraft
510Threaded@reddit
dont forget greg
fillgeez@reddit
True, both evoke the feeling at times that I should quite everything and become a goat farmer
enigmaunbound@reddit
I do enjoy abusing the English language.
pppjurac@reddit
For real expertise in swearing, consult Balkan people, esp. Serbians and Montenegrins
WackoMcGoose@reddit
True, but Polish is more fun to swear in. "Co do pierdolony, kurwysyn?!"
enigmaunbound@reddit
I'm just speaking the Dog Language over here.
alficles@reddit
That's ruff, buddy.
oceanave84@reddit
Zamknij sie, kurwa!
😀
510Threaded@reddit
I read that in his voice, and remember Tango's stream of that
notarealaccount223@reddit
We are a small shop still catching up to CI/CD. We manually watch some stats the day of a deployment (3-6 times a month for the big app).
More than once we've seen a change to average response time that was well below thresholds of warning, but a change none the less. None time out of ten it was a poorly written query that we would never have found 10 weeks later when it started causing major problems when the feature was fully utilized.
enigmaunbound@reddit
What other metrics would you be able to watch? Most of my experience is in systems and ops monitoring. It's interesting seeing the issue from a code deployment perspective.
notarealaccount223@reddit
CPU utilization and queue depth. Memory utilization. Disk iops and latency.
We look for changes in the baseline after a code change.
My worry with CI/CD is that those subtle changes go unnoticed until it is to late to find them.
devicie@reddit
Runtime visibility is the missing piece. Static scans catch known vulnerabilities but miss behavioral anomalies that indicate real threats. We've seen device management where compliance dashboards show green while actual behavior suggests compromise or policy drift. The challenge is distinguishing legitimate anomalies from genuine threats. Systems that analyze runtime behavior and automatically respond seem to be where the industry is heading. What tools are you using for runtime monitoring beyond traditional CSPM?
JustSomeGuyFromIT@reddit
Worst nightmare virus / bot / hack would be something put on PCs that sleeps or doesn't do much except to spread over time and then strike months later for maximum damage.
Like some mass murder killing or hurting a few people, wait for a crowd to form and then strike. Sorry for my weird examples but those tactics do exist and are used.
black_caeser@reddit
Well, the key word here is “compliant”. Most people nowadays confuse “compliance” with “security”, most notably a lot of IT (“cyber”) security “experts”. Real security is hard and requires a certain mindset, starting with defensive design. But that would cost time, money and make a lot of stuff not viable while not getting any of these sweet, sweet check boxes checked.
Btw all your findings of sketchy API calls, execessive resource usage, doing random stuff means that mindset is lacking with developers already. Because finding such stuff by other tools in a live environment is basically your last line of defense, all other layers have been broken already. But it’s a lot cheaper and easier — not just for management — to buy those tools than to actually write and operate software in a secure manner. Especially if the business sees the risks covered by insurance as long as you are — you guessed it — compliant.
bostonsre@reddit
Rate limiting api requests isn't the job of security tools, you would use a different tool for that. Security software is bloated, slow and unreliable as it is, I wouldn't want it going anywhere near my requests to try to throttle them.
LipglossMystery@reddit
Fr the “last line of defense” thing hit hard. if runtime tools are catching it, that means all the “secure by design” layers already failed. scary when you think about it.
knightofargh@reddit
Turns out security is hard. Like really hard. 99% of devs suck at it because of the difficulty so instead they craft elaborate ways to get around controls so everything looks compliant.
I’m always a fan of making application owners demonstrate their “compensating control” for a finding both works and actually exists. I’d say 1/3 of the time they decide it’s easier to just code to meet the control than keep up their kayfabe around compensating controls.
But what would I know? I’m just a JSON-monkey these days.
Intrepid_Chard_3535@reddit
Compliance is the floor, not the ceiling. The CrowdStrike crash exposed who actually had continuity processes and who was just waving around vendor logos. The ones who prepped properly degraded but stayed online. The rest collapsed and blamed CrowdStrike.
And let’s be real: this wasn’t some exotic, nation-state cyberattack. It was basically a glorified blue screen. Yet airports, shops, and government offices all went dark. That’s not a CrowdStrike issue, that’s pure incompetence and zero planning for running without your crutches.
The real problem is that boardrooms only fund the optics. They want green dashboards and flashy compliance toys, because that looks good in a quarterly report. Runtime detection, continuity planning, kill switches, chaos testing,none of that gets money until it’s already too late.
So we end up with companies that look perfect on paper, “compliant” in every way, but the moment something runs outside of the happy path the whole system falls apart. And then they act shocked when ransomware crews stroll right in. We don’t lack tools. We lack the will to fund real security instead of compliance theatre.
It’s embarrassing for the field, honestly
Regen89@reddit
Glorified bluescreen, what the fuck are you talking about?
It was a force push of code from a 3rd party vendor that caused unrecoverable bluescreens until there were recovery instructions and then for the immediate future it required physically touching every endpoint that was affected.
That is not a glorified bluescreen, that is paramount to one of the worst cyberattacks you could face and there was completely out of individual org's hands.
Some people literally got LUCKY only a % of their infrastructure went down, some were not so LUCKY.
Do you know how much money it would cost to be "ready" to turn around something like that if it didn't turn out to be recoverable? Even orgs with tens of thousands of endpoints would be lucky to have even a floating 5-10% of laptop stock to do replacements. The man hours for rebuilds would be astronomical. This is not even touching servers / entire services being offline, many of which are revenue generating. God forbid you had to rebuild AD.
Business will always all hands on deck for as long as possible to recover and get as much of their core business back up and running as fast as possible, but it's literally impossible to spin up your literal entire infrastructure on dime for nearly large enterprise out there. A good org will have playbooks and defined recovery plans for as much as possible, but the turn around is not 24-hours or whatever you think is acceptable to recover from "going dark".
Insanely naive and inexperienced take.
knightofargh@reddit
Maersk could give you a pretty damn good idea what lucky and recovering from a cyberattack looks like.
jdsalaro@reddit
What the fuck is this comment?
No sources, no details, all fluff
Why comment at all?
Arnthy@reddit
They’re referring to this:
https://www.wired.com/story/notpetya-cyberattack-ukraine-russia-code-crashed-the-world/
That was a rough few weeks/months of my career.
jdsalaro@reddit
Thanks, not so nice memories 😅
Keep fighting the good fight ✊🏼
Adept-Midnight9185@reddit
Oh what, aren't you sitting on hot spares with wholly separate and air gapped infrastructure for everything your business uses? No? Scrub. /s
andrewsmd87@reddit
In multiple cloud environments with "redundant DNS"
richpo21@reddit
After the crowdstrike there was a push by our support desk to DR the AVD environment. I’m still waiting for the answer on who’s paying for it.
Intrepid_Chard_3535@reddit
Thanks for this. You must work at Delta.
Regen89@reddit
Nope, just someone who was online at 2am when I first noticed my org was impacted and actually lived and breathed through the ordeal.
mixduptransistor@reddit
This is why I think Delta should lose their case against Crowdstrike over the outage. There are 3 dozen different things totally unrelated to Crowdstrike or their security software, and not be a security incident that would still trigger a meltdown in the same way and Delta *still* would have been unprepared
Yes, Crowdstrike knocked over the first domino but it simply laid bare the lack of a true DR or BC plan. Regardless of the cause, you should not be caught flat footed by a major incident, period.
Intrepid_Chard_3535@reddit
There was an airline here with the exact same issues. Their check-in desktops all died. In a sane setup you’d have spares, swap them out, and be back online. Same with fashion houses and their cash registers, basic redundancy, nothing fancy. But shareholders are more important. They literally calculate two days of downtime against what they’d spend annually on proper redundancy. Easier to take it on the chin, blame someone else, and move on
EnragedMoose@reddit
Shit like this reminds me this sub is filled with people who have never seen anything of scale.
Intrepid_Chard_3535@reddit
There is no need to be an asshole about a simple example.
jdsalaro@reddit
Said the guy being an absolute asshole elsewhere
Sources + info or GTFO
Intrepid_Chard_3535@reddit
Thanks for making my point
bageloid@reddit
And what about the 40,000 Windows servers taken offline?
Of course Delta's IT practices are shit and should be able to recover quicker, but lets not pretend it's as simple as swapping out some desktops.
Intrepid_Chard_3535@reddit
That was just an example. I'm not going to explain to people how to get your critical production workloads instantly recovered. Most people I have talked to don't understand and can't have normal conversations about it on Reddit. Therefore if you want to know, research it yourself.
bageloid@reddit
You can instantly recover 40000 windows servers?
_My_Angry_Account_@reddit
If all VMs and you have a good backup solution, you should be able to restore servers en masse without too much downtime. When your backup appliances and production servers have 100G link with all flash storage, restore and backup can be rather fast even with large amounts of data.
You can get about 2PB of all flash storage that can handle around 25-30GB/s load for about $400K. Only takes up about 2-3U of rack space. Most expensive part are the drives. 122TB U.2 drives are around $15K each and when you need 24 for the server and 2-3 cold spares, it really adds up. Recommend to keep cold spare servers as well if you can afford for faster recover in case of hardware failures. The servers are usually cheaper than the drives.
Using 2x 100G connections, you should be able to restore about 1TB/minute if you have such storage solution available.
Then again, getting the C-suite to spend money on infrastructure is like nailing jello to a tree so not many places have such solution available even if they can afford it.
bageloid@reddit
Say each of those those 40000 servers is 100 gigs in size.
At 1TB/minute that's 2.7 days
Brekkjern@reddit
If you hire enough StarCraft pro's, then you can cut down the time spent clicking at least. Their clicks per minute rate is insane!
bageloid@reddit
The precedent of allowing vendors to knock out all of your computers with no repercussions is... not great.
wrosecrans@reddit
The precedent of knocking out all your vendors because you figured law suits are a more practical insurance policy than investing in your own robustness is... not better.
kuahara@reddit
I'm with you on this. If an org that got their shit back online in 1 day was suing, fine. Whatever. None of my business. But Delta? Screw that. They had loads more issues and stayed down longer than everyone.
I wrote the same fix MS published and shared it a day ahead of MS. It worked great. We were resolved the same weekend. Maybe had some stuff to do Monday, I can't remember. Either way, Delta's posture was just plain awful and they're trying to pin it on a scapegoat. That should NOT be rewarded.
andrewsmd87@reddit
I love this. For me compliance is crap I have to do but don't want to. We are always way above what they ask for and the way auditors "vet" things is laughable
Sekuroon@reddit
So I'm curious because this stuff isn't really my area but I got dragged into it along with the rest of our IT team. So, please correct me if I'm wrong about anything here but when a vendor who's product has kernel level access to every machine you own and pushes an botched update out that puts all your devices into BSOD loops and fails to recovery, how do you recover from that when all your remote tools need the OS up to connect?
I consider us fairly lucky and it took us the entire day to get most of our user endpoints back up because each one required walking a remote user through navigating the windows recovery screen, putting in a 48 digit bitlocker recovery key and navigating to the corrupted file and deleting it via command line which was not fun to drag the user through. If you got lucky and your user wasn't a moron you could get them through it in about 7-10 minutes per device.
I certainly don't mind learning that we could have done better but do you have some tool or system that would have prevented it? I can't imagine there isn't a way to have done so but I just wonder how costly it would have been or what it relies on.
Intrepid_Chard_3535@reddit
That's not even that bad tbh. If your disaster plan lists which people need to up and running fast the company has minimum impact the way you are explaining. We have a certain amount of laptops readily available already intune ready. The others are just "reimaged". Here you are only limited by your internet speed. Intune is ideal for this.
Sekuroon@reddit
We did get lucky. Many of our servers weren't affected so our core environment stayed up and it was mostly just endpoints that got hit hard. Much of our environment is also laptops many of which were asleep while the corrupt update was live and it had been pulled by the time some of our users even got to work. We also discovered the manual fix method after just a couple of early risers called in so at that point it was just a matter of service desk volume but it still freaking sucked. All of our endpoint devices are Intune'ed and autopiloted so I guess Intune resets could have worked for much of it. Intune can't talk to the computer from the recovery screen so we still would have had to walk users through a process, I don't think that process would be that hard and I think users have the rights to reset their own machine but I'm not certain on the exact steps from the recovery screen. I think an intune reset would have been worse than walking them through the recovery screen fix because then there would be nothing to download, install or wait on to load content. Just get right back to where you left off.
I was just curious if there was a better method or something that could have prevented more of it. While yes, it wasn't that bad for us compared to many others, we still had a hell of a time.
Regular-Nebula6386@reddit
How can you plan for something like that? When our tools go rogue. I suggested to the IT Director to introduce some resilience in our stack and move some of the Domain Controllers and Critical infrastructure to Sentinel One and I was laughed at and was the butt of the jokes for a while. But if Crowdstrike strikes again, we will be back to where we were when the outage happened.
Intrepid_Chard_3535@reddit
You can introduce staging. Like defender you can roll out these changes to non critical systems first, then important, then critical. Look up how Crowdstrike deals with this as they promised to design something similair after the failed testing.
Black_Patriot@reddit
Sure, that's a feature now after they broke everything, but you're still relying on CrowdStrike to not break stuff with a poor release, it's about evaluating failure modes.
In an ideal world you'd split your systems so you had a mix of vendors for your critical servers such that no single vendor could take them down (the same as having half your VMs on Azure and half on AWS, for instance). Of course that's a very expensive and complex solution for an event that is hopefully rare, so would probably only be tempting for the most important of infrastructure where uptime is critical.
Personally I'm relatively confident that CrowdStrike won't make a mistake like that again (they'd almost certainly collapse if it happened twice), but it is worth evaluating that kind of risk with every system.
Intrepid_Chard_3535@reddit
You contradict yourself by talking about an ideal situation. He was asking for a tip, I gave it to him.
nitallica@reddit
\^ This. All of this!
Ok-Bill3318@reddit
Good luck getting devs to engineer in security 🤣
amphetkid@reddit
The other outcome is that the tools say 'this is not acceptable' and causes a bigger hole that is acceptable.
We were told do not use SSH for Gitlab because it would mean SSH would be exposed to the outside world but HTTPS is fine even if it provides lower security for code repos...
_northernlights_@reddit
> Last week a dev accidentally deployed a test service
I see that as the root cause more that the use of cloud in the first place. There are devs just pushing things in prod?
baty0man_@reddit
With no peer review?
eVeRyThInG iS jUsT gReEn.
What a shitpost..
faajzor@reddit
Checking SBOM is just one part of it..
mycall@reddit
Have you considered using something like coderabbit.ai to do code reviews that might catch that test service's issue?
Otherwise, your APIs should have throttling and hammering prod is a listener configuration error, not a test service error.
stupv@reddit
It's because cybersecurity posture is more focused on making sure 'bad things' dont happen, and less focussed on 'good things' producing bad outcomes.
As with anything, it's easier to find bad guys than it is good guys who do bad things by accident
Sasataf12@reddit
You're talking about 2 different things.
Resiliency/reliability != security
The issues that I see are:
stacksmasher@reddit
Who cares. CrowdStrike is so much better than everything else it’s worth the risk. I know for a fact it blocks attacks daily.
cbtboss@reddit
This is why our pull requests to prod require two approvers at approved maintenance windows so someone can't just go rogue intentionally or accidentally.
1RedOne@reddit
Security monitoring isn’t the only thing. You should also have synthetics and be monitoring resource consumption
I work on a global web service deployed to 68 separate geos and we monitor things like that, consumption rates, response times, even cache availability
I will say it took tons of outages to get here
factchecker01@reddit
Was there another CS outage?
andrewsmd87@reddit
Can you provide details on how they are able to hit prod, like was it a public endpoint or something? Genuinely curious
alexandreracine@reddit
Yep, the thing is, I don't think that there are monitoring solution that will look at a structure, and you manually confirm that this is normal, then this monitoring would have to be master of the ressources and can flag a system, and slow it down if something similar like that happens.
xenarthran_salesman@reddit
Recent, like, a year ago ?
dorradorrabirr@reddit
It's like that slowpoke meme
Motor-Pollution-947@reddit
You hit on something that keeps me up at night honestly. The crowdstrike thing was a perfect example of how our "secure by design" mindset falls apart when reality hits.
I've been building security tools for years and this exact problem is why I pivoted from traditional vuln scanning. You can have perfect SAST results, clean container images, all the compliance checkboxes ticked... but then at runtime your app starts doing completely unexpected things with data or making calls you never anticipated.
The gap between static analysis and runtime behavior is massive. Your example with the test service is spot on - it was probably "secure" from every traditional metric but caused real damage because nobody was watching what it actually did when it ran.
We built our hounddog.ai code scanner to catch some of this for data flows specifically because developers would write code that looked fine in review but at runtime would leak PII to logging services or send sensitive data to unexpected APIs. The transformations and variable passing made it impossible to catch without actually tracing execution paths.
gex80@reddit
recent? There was another one?
wrosecrans@reddit
A lot of people are gonna be almost outraged at the idea of "you should know what the computer programs are doing on the computer" because it's sort of regressive to think in terms of computer programs or individual computers rather than in terms of super high level orchestration concepts.
But at the end of the day, everything we do is with a computer program running on a computer. And yeah, knowing what it's doing goes a long way toward keeping from having it do stupid and dangerous stuff for no reason, and recognizing when it starts doing something different.
SikhGamer@reddit
I don't know why you expected CPSM/CS to catch a bug like this?
shimoheihei2@reddit
It's hardly a new problem. Executives listen to sales people selling them perfect solutions, managers focus on spreadsheets showing them specific KPIs and metrics were met, and none of them have any technical knowledge, or spent any time thinking "does this all actually make any sense?"
VexingRaven@reddit
Uh, do we? Pretty sure if you're not watching runtime you're about a decade behind the curve.
Dependent_House7077@reddit
haha, cloud what?
for every clever and all-covering security policy the universe will invent a more clever idiot.
there is always a piece of your infra that you implicitly trust and that will bite you back. no matter how well you prepare for it. next time someone will come up with yet another cpu exploit, find a flaw in networking layer, server firmware, etc.
security is a process and that process involved rolling with the punches sometimes. as long as you get up reasonably quickly.
vadavea@reddit
"security theater" is a thing, especially in big, highly-stovepiped organizations. They get to a place where environment is too complex to actually understand and mitigate risk, so they throw tools and bureaucracy at the problem instead.
DehydratedButTired@reddit
Who approved the change the dev pushed? That is where it should have been caught and where the blame lies. A CSPM is just going to catch the known things. Most environments are too complicated for any CSPM to catch everything.
JonMiller724@reddit
Is CI/CD being use to deploy to Staging, QA, and Production?
Developers should never be able to push to production.
They should be able to merge into the staging branch via Pull Request that requires at least 2 people to review. After manual review and automated tests clear, they can Pull Request to QA / UAT, after QA/UAT sign off the developer can then Pull Request to production with CI/CD handling the deployment after code review.
Generico300@reddit
If you just set up a bunch of tests and then use the pass/fail status of those tests to determine the status of anything, you are placing 100% faith in the idea that your tests are not only correct, but complete. They never are.
Never forget that metrics are statistics, and statistics are bullshit.
PrettyFlyForITguy@reddit
Security software, in general has never been very good at stopping well crafted threats. Case in point: I can install an RMM giving me full control, and do a number of things that disable and wipe the system.... never so much as triggering an alert.
It's really the basic stuff that does 98% of the work. Proper privileges, limiting rights, use of application whitelisting, etc.
Anything that detects threats, whether it be at the network level or application level has always done a poor job. It can block 95/100 threats, but it only takes one failure to cause issues. Trying to detect threats is a losing battle because, as you found out, it has to actually know the action is malicious. It is too error prone to rely on.
Prevention measures are always more effective.