Just exited a meeting with Crowdstrike. You can remediate all of your endpoints from the cloud.

[-]

kuahara@reddit (OP)

They said for legal reasons...I tried not to laugh.

If someone shoots me and then provides unauthorized aid, the unauthorized aid is not what I'll be suing for.

[-]

-Wave

gif

UncleGrimm@reddit

I can confirm this is working at our org. Not on 100% of systems but it just got us down to a few hundred left, had over 50k initially

[-]

Fresh_Dog4602@reddit

So and how does this system of theirs work then? because this is a sort of remote kill-switch or whatever it is they do. So it was always there to begin with

[-]

UncleGrimm@reddit

My understanding is- it’s basically issuing a threat command from their cloud to quarantine the file. They couldn’t roll this out immediately because the BSOD almost always won the race condition, so over the weekend, they reconfigured and relocated a bunch of their servers to make it more likely that the BSOD loses the race condition.

[-]

PlannedObsolescence_@reddit

They couldn’t roll this out immediately because the BSOD almost always won the race condition, so over the weekend, they reconfigured and relocated a bunch of their servers to make it more likely that the BSOD loses the race condition.

UncleGrimm@reddit

time it takes to reach out to the CS servers

I don’t think we’re in disagreement then? It’s winning the race because they reduced the time it takes for that to complete.

reconfiguring and relocating servers makes no sense

It certainly does when, while trying to deliver the automated solution, they are likely experiencing huge log query volumes from in-progress manual remediations.

[-]

PlannedObsolescence_@reddit

There definitely could be an element of truth to the 'reconfigure servers' thing, I haven't been impacted by the CS issue so haven't actually been hands on with a computer. But if the race condition between the BSOD and the agent calling home for commands could be 'won' more often by just a few milliseconds quicker of a response - or if the agent was already talking to the servers, just they were not prioritising sending the (update agent or quarantine file) command instantly, then I can definitely understand how changing the way that communication works could help things. But really I have no idea if any changes happened related to that.

From what OP and other in this thread said, the 'fix' you get opted into is for them to send the command to quarantine specific parts of the agent to force it to repair itself.

[-]

UncleGrimm@reddit

Correct, that’s my understanding of the process.

Thats also why MS published “reboot up to 15 times” as a potential fix pretty early on. That’s not a magic number used there by Windows or anything, the agent had a potential (but very slim) chance to win the race and pull down the update.

By sending the quarantine command from the cloud server, the network wait is probably reduced significantly versus trying to pull the fixed content file.

Latency makes a lot of sense to me here.

[-]

Fresh_Dog4602@reddit

oh man... just realized... all those companies with radius authentication probably going ffffffuuuuuuuu as this would delay the networking process (if it even can complete at that point unless you use stuff like MAB or something)

[-]

UncleGrimm@reddit

I work on the cloud team at a Big Tech that helped coordinate some of the response for this since CS is one of our partners (you can probably guess who).

We were directly informed of an automated solution undergoing experiments over the weekend, and the race-condition was something they were honing in on. Whether that’s definitively how this final version works, I don’t know that for sure, but it definitely tracks.

[-]

PlannedObsolescence_@reddit

The parent comment that /u/Fresh_Dog4602 replied to is now deleted, it originally said:

I can confirm this is working at our org. Not on 100% of systems but it just got us down to a few hundred left, had over 50k initially and a few thousand left.

[-]

nartak@reddit

Probably a billing killswitch for customers that don't pay.

Either way sounds like a MitM attack waiting to happen.

[-]

catwiesel@reddit

amazing

[-]

KaitRaven@reddit

This is presumably outside their normal procedures.

If they're going to make any atypical changes on your system, then yes it makes sense to get your approval first

[-]

SimonGn@reddit

As opposed to putting their customers in a boot loop being part of their Normal Operating Procedures?

[-]

DOUBLEBARRELASSFUCK@reddit

I haven't read all of the write ups on this yet, but I believe that may have been unintentional.

[-]

SimonGn@reddit

Intent does not matter. They messed up without approval but need approval to undo their mistake? Makes no sense

[-]

DOUBLEBARRELASSFUCK@reddit

They messed up without approval because you can't possibly ask for approval before fucking up. If they knew they were going to fuck up, they wouldn't have asked for permission, they would have not fucked up.

more vulnerable state

The cynic in me wonders if you opt-in, then later attempt to pursue for costs and damages, by you opting in to this remediation will it be used as a defence to absolve of any wrongdoing?

"Yes, your system failure was due to a technical error, but as clearly shown it was rectified in a timely manner following your written indication to opt-in.

If someone shoots me and then provides unauthorized aid, the unauthorized aid is not what I'll be suing for.

well, other people think different

at that point you are giving ring 0 of your operating system access to their servers via the network stack... lol is that even possible... wtf....

I'm not familar with their architecture, so I'll assume you are right. But there are still edge conditions that could occur.

Their ELAM driver (same as CS's) does pull definitions from dynamic files that are not WHQL driver certified.

[-]

cowbutt6@reddit

If that's the case with SentinelOne, then my understanding is that those dynamic files are part of the sensor distribution, and don't change unless you upgrade/downgrade the sensor to a different version. Which you should have tested first, of course.

[-]

thortgot@reddit

I don't believe that is correct. Their architecture is quite similar to CS with a split between sensor (agent) and definition (channel) with real time intelligence

My point is that you still have "uncertified" driver activity occurring in the kernel at a bootstart level.

If they allow for definition update rings then it would mitigate much of the risk but I haven't used the platform in quite a while.

[-]

UncleGrimm@reddit

S1 sits at the kernel level as well. Otherwise these solutions would be pretty useless against kernel-level malware.

Fresh_Dog4602@reddit

i"ve seen the "15 reboot" method pass by. I've seen many ppl saying it doesn't work. But mileage may vary i guess

[-]

thortgot@reddit

Depends on how quickly the driver is crashing versus how long your network stack takes to connect.

msalerno1965@reddit

IKR, what the F day is today? LOL

I gotta wonder how many box-seat tickets were handed out by sales reps... some of the "it's OK, it's fine..." comments all over social media are indicative of ... something odd. The apologists are out in force. Maybe they're just stock holders trying to keep their last $.02.

Also, remark to OP: the use of the word "silly" is ... silly. That is, if you're slap-happy silly from working non-stop since Friday.

[-]

hercelf@reddit

Yeah, I'm surprised I don't see any more comments - it was such a high impact thing because it couldn't be automatically remediated, and now it turns out there was a way after all? Even a worse look for Crowdstrike in my book...

[-]

darcon12@reddit

I mean, they are the top cybersecurity company in the world, and it takes them 4 days to figure out they can trigger a quarantine of the file and fix it remotely? Give me a break.

[-]

sol217@reddit

For real. They were already at the top of my shit list and they managed to move up the list even higher.

[-]

ItsWhomToYou@reddit

That’s the problem, getting remote users to hardwire to their personal network is virtually unheard of in today’s landscape.

already had to give our local admin password to remote users

You're describing local admin. Local admins can fix one of these broken machines. Without local admin, they can't.

[-]

IHaveTeaForDinner@reddit

Glad for a resolution, but it's a bit worrisome they have that much access and control over our OS

[-]

Unusual_Onion_983@reddit

Correct answer here.

[-]

Shockingly it looks like that's actually wrong. I was going through some of the boot start driver documentation and found that signature stuff like they have seems to be fine

https://learn.microsoft.com/en-us/windows-hardware/drivers/install/elam-driver-requirements

Sure the whole execution as signature thing seems to be more than a bit of a stretch for what it's intended to do(although I'm also trusting random internet comments on what it's actually doing here too), but it's still an intended mechanic of the early launch anti malware driver stuff that microsoft made(Put in a consistent location, preferably signed, that sort of thing). Sure when the system was put in place it was back when AV really was pretty much all signature based but a lot of modern ones just don't work that way(or just that way anyway), and that kind of leaves this in a weird place where you're putting something in place that really shouldn't be there but microsoft hasn't put a validation process in place to handle it any other way(the full driver validation is much too slow).

The part that I've been racking my head over is the crash recovery. Drivers, including ELAM like theirs allow for last known good drivers to be launched, and reading though the documentation I'm not sure if that covers the signatures(and I'm thinking it doesn't, and if it did it might only be for corrupt files anyway I'm not sure).

But the point is, I think that people may be getting angry over the wrong things. In my opinion it should probably just be a driver that wasn't written well enough, maybe poor testing, and definitely the lack of deployment/staging options for definitions in addition to those two.

or kaspersky

(zonealarm managed something similar in 2005 - a freebie software firewall that... after a brain file update, stopped _all_ traffic to and from the pc.

[-]

BattleEfficient2471@reddit

And it appears in this case both are not.

Exactly this.. Windows already has built in solutions to detect rogue code at boot. Example: Secureboot, Secure Launch, Kernal DMA protections, Defender ELAM and more..

[-]

DreamLanky1120@reddit

No, no, no, don't set your stuff up right. Far too risky, you pay CrowdStrike, do the one-click installer and then blame them if anything happens to your critical infrastructure.

Only to be informed that there are AGBs that clearly state that you should not use their software on any critical infrastructure :)

It's the way. You could also ask ChatGPT and do whatever it says.

[-]

incidel@reddit

THIS!

[-]

McBun2023@reddit

In order to kill the malware, you must become the malware

[-]

gif

[-]

RogerThornhill79@reddit

[-]

Skullclownlol@reddit

Also kind of scary they can access the systems pre OS boot?

Why would you think this is scarier than a kernel-level driver that has access to everything anyway?

[-]

Coffee_Ops@reddit

Kernel level doesn't have access to everything on Windows 11.

[-]

HamiltonFAI@reddit

The app having kernel level access sure, but that kernel level access can be contacted remotely without the OS is another level.

[-]

xfilesvault@reddit

No, it can’t be contacted remotely without the OS.

It tries to update the definitions BEFORE applying them. But it doesn’t wait long.

So if your network is quick to initialize, like wired internet, it will download the updated definitions.

Otherwise, it applies the existing channel update and then crashes.

It’s a race condition. Sometimes it will fix, sometimes it won’t. Bit is not because they have something else crazy loaded on your machine.

It’s just the same kernel level driver that is running the first lines of code. The first lines of code MIGHT SOMETIMES succeed at fixing the issue that causes the crash later on in the execution of the driver.

[-]

Coffee_Ops@reddit

Sounds like you don't understand the level of access you give the vendor of your EDR.

Consider Defender if it bothers you.

[-]

maggmaster@reddit

As a sys admin this is the smartest comment. All the bad actors are watching this.

[-]

Moontoya@reddit

Doesnt that also suggest its pre-encryption?

[-]

VintageSin@reddit

Linux admins out here just looking like

gif

[-]

Skwalou@reddit

This makes no sense, why would you be scared to give control when you specifically hired them to protect your data? It's like being scared of your bodyguard because he is following you...

[-]

CosmicSeafarer@reddit

I mean, if they can do it then adversaries can do it, so wouldn’t you want that?

[-]

ChihweiLHBird@reddit

Many Antivirus softwares are running as a kernel module, which is the reason why it could cause BSOD at the first place.

[-]

Pixel91@reddit

Yeah but most don't run as rootkits.

[-]

crusoe@reddit

Every Intel server has a management engine that runs Minix with full network and file system access. The dedicated port should be on its own segmented network.

Not in IT, but we have about 9000 end users effected being manually remediation by IT. They call us, give us an admin login, directions to delete then reboot. 13 minutes.

In a perfect setting where all computers are connected to the PXE network and you have easy access to all of them, one person could do 2,000 computers in an hour. But most people don't have this kind of setup (especially in 2024) and it's not because of laziness or sloppy work.

This is why it's hard to believe the 1 hour claim of this person. Had they made the claim without the comment about lazy and sloppy, it would actually be more believable.

[-]

tell_her_a_story@reddit

PXE boot requires infrastructure in advance, not something we use. The remote users hardware is assigned to the individual and funded by their department. Stacking them up and running an assembly line to resolve would end up with hardware not returned to the rightful owner. With the shared/generic auto login computers, the techs most definitely kicked them off one after another and went down the line minimizing idle time.

[-]

LeadershipSweet8883@reddit

I was pointing out that the other user that did 2k end stations in an hour may have been able to PXE boot them.

The ownership issue is easily solved with a P-Touch label maker or a stack of sticky notes. Not completely necessary but if you are processing thousands of laptops then the throughput boost is probably worthwhile, especially since you can allocate techs based on the current size of the queue for each station.

I saw some places had Bitlocker keys printed on barcodes and inputted using a USB scanner - you can print the commands in barcodes as well.

[-]

tell_her_a_story@reddit

Fair enough.

[-]

Solidus-Prime@reddit

Like I said - lazy and sloppy.

[-]

nantuko__shade@reddit

You must not have bitliocker-encrypted drives.

[-]

Solidus-Prime@reddit

We do actually.

I'm 99% sure MS created the KB5042421 article based on my feedback to them:

https://www.reddit.com/r/msp/comments/1e7xt6s/bootable_usb_to_fix_crowdstrike_issue_fully/

[-]

nantuko__shade@reddit

That’s a clever solution but you did not create that bootable USB, distribute it to 2k end users, and have them all fixed “within an hour of being affected”. Which btw was approximately 2AM on Friday morning

[-]

Wolvansd@reddit

It's all of our own internal IT folks doing it; no contractors.

Manually remediate? According to George it is a 15 min fix and a reboot.....

FGeorge

[-]

tell_her_a_story@reddit

We began remediation at 2am on Friday. At that time, we were booting into safe mode, unlocking the drive via Bitlocker, logging into the PC using a local administrative account with passwords pulled from LAPS ui, deleting the file, then rebooting and logging in using domain credentials to ensure everything came back up.

Depending on how many tries it took to actually get into SafeMode, it varied from 10 to 20 minutes per machine.

They’ve played the notice here

https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/

[-]

Fresh_Dog4602@reddit

myea but not really explaining what it is they do.

[-]

Jose083@reddit

Why wouldn’t you trust crowdstrike and the hidden stuff they do inside a critical directory of your system?

bmyst70@reddit

Apparently not. Nor do they even do a simple MD5 checksum comparison to confirm the update definitions are valid.

You know what even Clam AV does for its virus definitions.

[-]

Xalenn@reddit

I'm still surprised that they were able to get WHQL cert for a program that runs outside untested code at that level.

[-]

Jose083@reddit

Think it’s because it’s the definition that is getting out breaking stuff but the driver itself is WHQL certified.

I guess the nature of product you can’t wait for WHQL turnaround on every definition file for obvious reasons.

I asked during the meeting for a publicly accessible info page on this and they led me to their 'blog'. This was the best that was provided. The green box at the top alludes to it. I believe there's more specific information locked behind individual customer logins.

have they not heard of robots.txt

Why not?

[-]

ByTheBeardOfZues@reddit

Yeah I've always been able to access documentation. I have had to log in for solution articles though.

[-]

R8nbowhorse@reddit

That could not be further from the truth.

[-]

EWDnutz@reddit

Yup. I've noticed the same for a lot of platforms and it's terrible.

At least make health/status pages publicly viewable....

[-]

I asked during the meeting for a publicly accessible info page on this and they led me to their 'blog'.

You asked for this because you recognized the importance of this meeting and knew before the words came out of your mouth that you were headed right back to this sub to share with those who are still burning the midnight oil what you learned.

I'm not sure there's a finer example of the spirit of this sub. Well done, lad / ladesse.

[-]

flatvaaskaas@reddit

Hmm i only read that they have a new way with an opt-in. But no explanation what this is, or how it works?

This is basically what we did using ConnectWise Control. Queued the command in the portal and then restarted the device. Sometimes took a couple tries but worked really well.

Nik_Tesla@reddit

That's really good news... but it makes me sad about all of the IT folks who absolutely killed themselves this past weekend to do it all manually. Especially those on salary that are just going to get a pat on the back and a starbucks giftcard at most.

The best thing about this is, all those devices that are with remote users or at some far away location, that they weren't able to get to yet, can be fixed. I was thinking this was going to drag on for weeks with the last 10% of devices at each company taking a long time to physically get to.

[-]

Pork_Bastard@reddit

Not everyone works for a slave driving shithawk. My firm wouldve paid full overtime and fully given us major props! Luckily we werent using CS as our EDR but theyve proven themselves many times including a major breach 5 years ago. Dont put up with assholes

[-]

Nik_Tesla@reddit

Not everyone, but clearly enough work for slave driving shithawks, considering this is the top of the subreddit right now

https://www.reddit.com/r/sysadmin/comments/1ea9lpr/so_who_else_is_looking_for_a_new_job_after_how/

[-]

bageloid@reddit

They just pushed this to all clients in the US-1, US-2 and EU tenants.

[-]

poorleno111@reddit

Yeah, just saw that. I think this will probably what gets us to move on from them even more. We didn't want their fix as we don't trust them, then they just push. I'm hoping our legal team comes down pretty hard on them.

[-]

VedantaSay@reddit

How do you qualify best?

[-]

2Ks@reddit

gif

[-]

Spare_Conference3467@reddit

gif

[-]

markdacoda@reddit

https://www.youtube.com/watch?v=XrrryadgchI

[-]

And Crowdstrike aren't explicitly talking about disabling fast boot?

[-]

RoadRunner_1024@reddit

wrong, this new "fix" uses crowdstrike to quarantine the affected sys file. so if the pc can boot far enough to get the list of IOC's before the system crashes, then crowdstrike quarantines the file and the issue is fixed. no scripts involved. no connections to the internet (other than connecting back to crowdstrike which has always happened.)

Yep. So this issue has been described in length so I won’t go into that. But, today I realize below:

Fair, but they already did when they replaced the sys file shortly after the fuckery.

[-]

what-the-puck@reddit

Sort of - that was documented functionality. Effectively a definition update.

sm00thArsenal@reddit

Yup, the fact that this is possible but it took them nearly 4 days to release and even then as an opt-in is almost worse than it not being possible.

[-]

BalmyGarlic@reddit

Or if you're going to require an opt-in then blast it out to every client in you system via email and robo call to get those opt-ins or direct people to where to do it. Also instruct your call center to do the same thing, to get the clients without working phones and email back up. Also post the instructions on your website and blast it out via social media.

There are much more efficient communication methods than scheduled meetings...

[-]

crowdstrike is in for a ton of lawsuits. Contractual double speak will not protect them from negligence.

[-]

Desperate-Tip6702@reddit

We’ve been doing this manually on all of our machines smh but have been running into issues bc some machines are secured with the bit locker key so you would need to run the key unlock cmd before you run the del command

[-]

Ok-Garden1663@reddit

Reboot my cruise I couldn't get to.

[-]

PweatySenis@reddit

Please tell me you got reimbursed or at least a free reschedule

[-]

slyboon@reddit

Management here decided not to opt in. We supposedly had another 750 machines or so left to remediate at EOB yesterday and should finish up today. Guess they decided not to trust Crowdstike but it would have been nice if some of these would have been done.

How is this different than the original "reboot up to 15x" fix provided on day 1?

What about the opt-in program makes that more reliable?

[-]

KaitRaven@reddit

I think that depended on the normal Crowdstrike Update process replacing the file, whereas this is an explicit command to remove it. Probably works a little faster as a result.

[-]

watchthebison@reddit

We got offered this earlier today and you’re right. Was told by an engineer the quarantine of the file has a higher priority than fetching new channel-files, resulting in a higher success rate.

Decided to sit on it because we are nearly fully operational again through the manual fixes and the # of clients they quoted having remediated automatically was much lower (at the time they offered it). Felt a bit risky.

[-]

Why the heck is this opt in? Just blacklist it for all and push a new update

Because some guys at a government agency are currently panicking about the idea of a random company being allowed to remotely edit critical directories in all their endpoints during startup.

[-]

randomdude45678@reddit

If all you have to do is opt in, they’ve had this ability all along and will moving forward

[-]

kirashi3@reddit

Ah, yes, the same people who deliberately installed the same software that already has this functionality. The logic behind some institutions will never cease to amaze me.

[-]

hotfistdotcom@reddit

why are people JUST hearing about this late on monday? Like that's good, but also, that absolutely fucking sucks for the millions in labor pissed away over the weekend.

[-]

barkingcat@reddit

cause you needed the support contract to hear it from your rep.

This is the last chance Crowdstrike has to make money from a big portion of their clientbase, might as well charge people for the fix if they're likely to not renew.

[-]

If they had this capability the whole time why would they wait 3+ days to offer it? Something is fishy here.

[-]

Defeateninc@reddit

Thank GOD!

I am going to call my rep right now. After doing the 2000th machine manually I am DONE!

[-]

Dull-Sugar8579@reddit

I hope your not paid hourly.

[-]

ReanimationXP@reddit

Nice of you to post this in sysadmin instead of /r/CrowdStrike..

[-]

BattleEfficient2471@reddit

You mean a place where it may well be removed by CrowdStrike employees?

[-]

phillymjs@reddit

It was advised to use a wired connection instead of wifi

You can already load the iso onto your pxe server and net boot all your virtual servers to run that. Server remediation can happen en masse. I actually wrote a tool to do it Saturday, tested and confirmed that it works. Our guys at the agency also confirmed it was working. Microsoft released almost the exact same thing Sunday morning.

Microsoft publication: https://techcommunity.microsoft.com/t5/intune-customer-success/new-recovery-tool-to-help-with-crowdstrike-issue-impacting/ba-p/4196959

Direct download link: https://go.microsoft.com/fwlink/?linkid=2280386

I have not updated mine for bitlocker, but Microsoft's already includes that. If you don't use bitlocker and want to use mine, I can PM a google drive link.

I let this go since MS has the trust and bandwidth to distribute this farm more efficiently than I can. My tool is 377MB.

[-]

JustInflation1@reddit

Why are you working on a Saturday? I hope you’re in for a big raise.

[-]

kuahara@reddit (OP)

I was planning on putting out a blog post on the tool and monetizing it with an ad or two, maybe a donation link. Plus just seeing if I can be the first one to do something like that is enough Saturday motivation for me. I was worried that a gazillion people would download it at 377MB a pop and either take the site down or I'd get billed for bandwidth. No need to bother with it now.

You don't need an active subscription to read RedHat's articles, just have to sign in.

I'd be curious about the language included in the opt in. Does it limit the liability for CS?

[-]

JetreL@reddit

https://petri.com/microsoft-crowdstrike-recovery-tool-windows/

[-]

PlannedObsolescence_@reddit

StPaddy81@reddit

We opted in and it seems to be doing its thing. I did notice that some hosts that were not blue screening are showing up as having that particular file quarantined, I’m assuming they do it by sha256 hash and not file name, so I’m wondering why some of these machines were not blue screening if they had the affected channel update file on them.

I reached back out to support for more info.

[-]

KaitRaven@reddit

The old version of the 291 channel file is not automatically removed when devices get the updated via the normal process, it's just superseded and remains in the folder.

[-]

VegaNovus@reddit

It's not done by hash; the hash varies customer-by-customer.

[-]

StPaddy81@reddit

Every single file being quarantined so far (75+ instances) has the same sha256 hash. Not sure how else they would do it...

You'd just need to deal with this the same way you would if a remote user locked their laptop and it got stuck at the bitlocker screen.

All this method needs is a normal boot (not recovery, not safe) and then to win a race condition.

[-]

e0m1@reddit

I personally tried this and like 10 or so boot attempts, too many variables. I can't just keep rebooting and hoping. I hate you crowdstrike, you literally ruined my weekend. I was a huge advocate.

[-]

cowprince@reddit

The opt-in method with the reboot is different though.
Before it was 100% luck. Now it's just 50% luck.

[-]

crankyinfosec@reddit

The fact this wasn't automatically opt'd in on Friday for all customers impacted is insane. I appreciate the solution 3 days late but this is the final nail in the coffin that is making us move away from crowdstrike. A chuck of our laptop fleet doesn't have an onboard nic, guess it doesn't work nearly as well in that situation, a ton of them are still fucked even after several reboots.

[-]

kuahara@reddit (OP)

I released a similar tool a day ahead of MS. This is still good for when you need to remediate manually, but the cloud solution is going to be far more efficient.

[-]

Burnerd2023@reddit

Nice