We Forgot 500 Servers Were Still Using the Old Proxy
Posted by Leather_Meat939@reddit | talesfromtechsupport | View on Reddit | 45 comments
A while ago, back in Healthcare IT we had just finished moving all of our Workstations to a new Internet Proxy solution, ZScaler.
Everything was working perfectly, but then a DC alert for a hardware failure on our old proxy appliance came through, and it caused a bit of a panic.
Somebody asked the obvious question:
What’s still using this old appliance?
Our head of IT answered: all 500 of our servers.
How I got pulled into this.
The Ops manager came up to me and asked for a “quick chat”.
what’s you workload like at the moment?
Uhh, I’ve got a few things going on but I can make room if you need me.
I’ve got a bit of a project you might be interested in, and I think you’re a good fit for it.
oh, what’s it about?
Basically, we need to migrate all our servers from our on-prem proxy appliance over to ZScaler.
We’ll set up some meetings with the Web Security team, projects and IT leadership to help you get started.
I reckon this will be a good challenge for you.
Sounds good! I’ll start looking into this and get an idea of what’s involved.
The Timeline
I knew this was not going to be a simple task.
We had about 500 different servers that were in scope for this migration, mostly Server 2008 and 2012 VM’s with a ton of proprietary medical applications running on them.
There was a lot of pressure to get this right, to minimize clinical impact and do things properly.
I’d been in this role/team for a year, and while it was technically a Level 2 team it was still my first IT role, so I wouldn’t have expected to be able to do project work like this, especially so soon.
Still, I was feeling confident, and figured if nothing else, it would be a good learning experience and a chance to get recognised.
Where do I even start.
There was no existing plan. No documentation. Nobody had done this before in the company, I was on my own.
I built an Asana page to track my work and set some tasks to get me started.
- How is the current proxy configured?
- Where are those settings coming from?
- What actually needs internet access?
- What’s been allowed historically? (and does it still make sense)
Introductions
We had an intro session with all the key teams, infrastructure management, web security, and other leadership.
These are the people that will work with you to get this over the line, get to know them and reach out if you need anything.
I asked for some access to Zscaler and the existing proxy appliance so I could actually see what I was working with.
We manage those systems, we can’t give you access.
Fair enough.
But we can give you proxy logs via Splunk.
Not ideal, but better than nothing.
At that point, it became clear my role wasn’t just technical.
I was here to:
- Coordinate multiple teams
- Audit the environment and analyse logs/data
- Design the approach and implementation Obtain approvals through Cyber/GRC
- Report progress weekly to leadership
Somewhere along the way, I’d effectively become a project manager, I was running everything, assigning work to other resources and planning overtime work for the whole team.
How were things setup at the moment?
Well, it wasn’t great.
We had an old Bluecoat proxy appliance which was:
- Long out of support
- Running on failing hardware
- Somehow still critical to everything
Every server pointed to this appliance but used:
- Various different DNS aliases
- Sometimes direct IPs
- Hardcoded settings within all of our medical apps
The main source of these settings came from GPO’s, which would set the user proxy globally, and override any changes.
That wasn’t it however, there were some apps that pulled down their own central configs from Network shares or control servers that needed to be updated.
The company policy also required that admin accounts never have Internet Access, authentication was setup via SSO to restrict this on Bluecoat, so a solution would have to be found for ZScaler too.
Server types
Our clinic servers were basically in one of 3 categories, we had \~150 of each of the below, all onsite at the clinic:
Most clinics had three main server types:
VMREMOTE – jump box for remote users, plus file/print.
VMDB – database server, highly restricted.
VMAPP – application server running a mix of medical software We also had plenty of legacy VMs still used for accessing old databases that and all our Hypervisors to look into.
Testing Begins
The first thing I needed to do was to get control of the GPO that was setting the user proxy, and discuss with the app support team about all the medical apps and their settings.
I spoke to our server guys, and got an exclusion group setup, this allowed me to start testing and documenting what needed fixes.
I spun up a test clinic in the IT office, but quickly realised that the ZScaler “whitelist” needed a lot of work.
VMREMOTE servers couldn’t even browse the internet, which meant doctors working from home would be completely lost.
Our users use VMREMOTE the same way they use their workstations, so the Internet Access on them needed to align with desktops, I knew this would be a battle for approval.
For the other servers, I pulled a month of proxy logs from Splunk and started filtering traffic by server group.
The goal was simple:
- Identify destinations.
- Map destinations to applications/services.
- Justify why they should be allowed.
I’d end up with a report that looked huge and from there I would comb through each entry, figure out what app/service it links to and then find a way to justify it being allowed.
There were a lot of funny ones, I remember we had requests going to icanhazip which didn’t exactly look great when presenting it to Cyber, even if it’s really just a medical app doing it’s thing.
I couldn’t prove blocking it would break anything, but if/when it did I’d have to be there to fix it.
I drafted 3x long approval requests, one for each of our server types, submitted them, and let the chaos begin.
The approval meeting
After I submitted my approval request, I got sent a meeting invite to discuss the approval, but this meeting included:
- Cyber
- GRC
- IT leadership
- Executives from the parent company
There were multiple CTO’s and IT executives in here, this was a big meeting.
I presented my case, described what I’m trying to achieve and my justification for the request, one of the managers turned to the head of IT for my company (let’s call him Bob) and started going wild.
Bob, why do so many of these servers have multiple roles?!?
Is there a genuine need for workstation-level Internet access on VMREMOTE’s???
When are these boxes being upgraded to Server 2019/2022?
At some point I stopped presenting and just watched our head of IT get absolutely grilled.
Eventually, we got what we needed, conditional approval (with a lot of follow-up work attached).
New Problems
Seeing how crazy and difficult this was to get approved, I started to think about this more from an attackers perspective, I had some things come to mind.
- What web browsers are we using on these servers?
- How are we going to block admin users from Internet Access?
- What if someone downloaded an executable?
I started looking into the browser one, there was a mix of different browsers installed, I ended up chatting with IT management and the decision was “Chrome”, even though it was technically unsupported for Server 2008 at this point.
Another big problem stood out immediately when I was testing the access that was implemented.
Bluecoat blocked executable downloads, Zscaler didn’t – not without SSL inspection, which “wasn’t on our roadmap right now”.
I ended up building an AppLocker policy to block execution of anything in downloads and user folders, and a Software Restriction Policy for the old machines that didn’t support AppLocker.
Pilot time.
With approvals in place, we moved into pilot.
I picked a smaller site, cut them over after-hours, then tested everything.
There was a lot of trial and error:
- GPO changes took forever (klist purge helped here)
- Apps broke in annoying ways (throwing random errors)
- Vendors didn’t like our ZScaler IP’s (they had to whitelist us)
- Some apps were still using the old proxy (and I didn’t know how)
It turned out that some apps had proxy configs hidden in strange places that were undocumented, like random XML files.
All in all though, the key components worked, I was able to cutover the server, identify things to work on the next day, then cut it back.
Production Rollout
I was well-aware that a lot of what I was doing could have been heavily automated, but at least to start with, I wanted to be very involved and make sure that things worked the way that a user would actually do them.
For a lot of the medical apps we needed to login to the machine interactively, open the app from the tray, enter the app username and password, dismiss an update prompt, navigate to the proxy settings, clear them, ensure the app works, and monitor the logs.
There were a lot of edge cases that came up:
- We had some servers that refused to save the new proxy settings, and ended up having corrupted registry’s (fixed by importing from another server).
- There were some sneaky host file entries causing hits to the old proxy that was not actually real traffic.
- We had 2x clinics on non-SOE setups which had different software stacks or attempts to further lock down the proxy settings, which were fun to figure out.
The Prod rollout gave me an opportunity to log on to every server in the company. and I started noticing small things, like broken Windows Activation.
At one point I made a shocking discovery, an entire server fleet (50+ VM’s part of one of our business units) did not have Carbon Black installed at all (our EDR software).- Let me know in the comments if you want a post on the aftermath of this one!
All in all though, I worked through all the ZScaler cutovers after-hours, a few months in and I was done, everything was working.
Confirming my success.
Not long after finalising my rollout, I requested all our Server IP’s be blacklisted from Bluecoat.
This was a simple, reliable, easy-to-revert way to ensure that we have completely cutover our fleet and that there is nothing that we have missed outside of the logs.
That got implemented without any issues, we tested things and made sure all our apps functioned after the block.
Awesome, we’ve now finished our transition and nothing went wrong, right?
The “Cleanup” Incident
Remember that exclusion group for the GPO that I had made so that we could manage the rollout smoothly?
Well it was time for that to be retired, so I put in a ticket to the server guys asking to make the behaviour of this group the default for our OU and remove the group afterwards.
Except… they didn’t do that.
They deleted the group, didn’t implement anything to make the desired setting default, and then deleted all the policies setting the old proxy settings (or so they thought).
Turns out there was an oversight and the old policy was still applying because of loopback processing.
So our servers all started reverting back to the old proxy server, which was now blocking requests, and we had an influx of tickets come in for all sorts of issues.
Our ops manager was fuming, and I was not happy either.
Undoing this mistake took another 2 weeks, we had to actually reboot a fair few of our servers to get them back on the desired settings. Not cool.
Lessons Learned
In the end, the project went pretty well, it was honestly a good opportunity to get hands on with our environment and more involved with the business.
If I was doing this again, I would spend a lot longer working with splunk data and traffic logs, creating better reports for each of my weekly update meetings (there was a lot of good client data).
In the end though, I learned tons about Windows Networking, project management, medical applications (and how frustrating they can be to configure), and how to use monitoring tools like Splunk to get the data I want out of log dumps.
I hope you enjoyed the read!
Cheers,
Ha-Funny-Boy@reddit
This one made me think of a medical company I worked which was about a screwed up as this one.
By any chance was it KPIT?
Jedasis@reddit
Oh my god. The company I work just moved away from ZScaler recently, and it's a pain in the ass to remove. We had to manually uninstall it from like 200+ machines. It's been about a year but every now and then some rogue machine we missed has an issue because of it.
Tight-Gain-3308@reddit
yeah, we did the same thing with an old Squid box back in 2016. took us a week to realize no one had documented half the static routes pointing to it just a mess of DNS failures.
ramdomvariableX@reddit
Nice work OP, how soon after this did you leave the company? :)
justReading0f@reddit
Well I actually did enjoy reading this, thanks!
Though, I’m not in tech so lots of the terminology is unknown to me, but I know just enough to be able to walk alongside and get all the people-not-having-done-what-you-expect parts, and just enough about projects from the underside to be able to enjoy the untangling and final ironing out.
👍🏼
DiodeInc@reddit
Which terms? I can help define them if you want :F
justReading0f@reddit
Lol I’m glad you put a smile at the end! A LOT of them 😅
DiodeInc@reddit
Give as many or as little as you want, I've got time 😁 or don't, no problem!
justReading0f@reddit
Aw thanks, but I think I’ll just stay with my level of general understanding; it’s like my comprehension of physics; I enjoy more of the concepts than the actual math work. :)
DiodeInc@reddit
No problem :)
cbelt3@reddit
Yikes ! Somewhere I expected power to be a bunch of daisy chained extension cords…
meitemark@reddit
I see what you did wrong. Deletion/removal of group/app should not be included in a change request. The correct one there would be first change, test for working, then removal. Remember, these are server guys and that is only a few steps above drooling endusers.
Mad-Maxwell@reddit
>what’s you workload like at the moment?
Is something you never want to hear in your workplace.
dreniarb@reddit
So many individuals and/or departments relegated to specific tasks - it's a large org and this makes sense. Also makes me appreciate being a small org and not having so much red tape to go through to get things done.
Then again - maybe in some ways it would be nice to have the red tape.
johnlooksscared@reddit
Is this a job application ....???
GovernmentOpening254@reddit
Definitely a LinkedIn post.
Leather_Meat939@reddit (OP)
You reckon it's LinkedIn safe?
I'm hesitant to post stuff like this when seeing the company I previously worked at is so easy on that plaform.
GovernmentOpening254@reddit
Maybe get permission (written) to share it but it was obviously a LOT of work and coordination and problem solving. You’re apt to get job offers
Geminii27@reddit
"...sucker!"
vaildin@reddit
I read that line as "no one else wants this job, and you're too low on seniority to say no."
skooterz@reddit
Voluntold.
4rd_Prefect@reddit
Picking up a difficult job and kicking ass is a great way of showing your worth & gaining experience for future roles (internal or external)! Well done 👍👍
Pickup_Man77@reddit
Oof
Leather_Meat939@reddit (OP)
Hope you all enjoy!
If you want to read this with images the original post I wrote is here.
I'm not allowed to link in the original post sadly.
GSpider78@reddit
Great read. Take my updoot for plugging Splunk!!
FoursGirl@reddit
....and when you plug Splunk, remember that Cribl will make your Splunk data more manageable! Yes, Splunk & Cribl, for all your data needs!
(Imagining an old-timey radio announcer)
(I think I need more sleep......)
GSpider78@reddit
Ahh you mean the Crible that literally stole Splunk code, lost the court battle, and got away with a $1 penalty? That Crible?
RexCanisFL@reddit
Is the Carbon Black story on there too?
Leather_Meat939@reddit (OP)
Will be once I write it lol
ccarlen1@reddit
The images make it even more fun. And the article on your website about the suckiness of ServiceNow is a great read too.
Final-Excuse-7236@reddit
I am noob level and I love reading projects and fixes 4 levels above my brain. Esp when it's well written! Well done wise master.
PC-NerdxD@reddit
Odd question maybe, but why do you have what I assume to be just a Windows server on VMRemote for the users to use as workstations instead of a proper VDI infrastructure.
Leather_Meat939@reddit (OP)
Good question!
One of our other business units was actually fully VDI, so we did have some infrastructure for it.
The part of the business I talked about in this post though did actually have VDI trialled with them and I'm told it "went horribly".
It also likely came down to cost, we had these servers already onprem, running other roles, it didn't hurt to add RDS on top.
rolltied@reddit
The fact that it's healthcare makes this 1000 times more difficult than normal too. Well done.
Throwaway_Old_Guy@reddit
I am not IT
I do get the feeling that much of what they grilled Bob on was a direct result of Executive decisions which often tend to be myopic and me-centric, focusing on whatever makes themselves appear "effective" for bonus purposes.
I hope OP received financial recognition and some well deserved down-time.
RedPhalcon@reddit
100%. When I came on board we had significant tech debt because the IT manager who worked his way up from the mailroom with a business degree just couldnt explain what he needed to the CSuite. So instead he kept all purchases under $5000 as those didnt need approval.
Clevels finally figured out what was going on after the manager saw the writing on the wall and jumped ship. We spent the next approx 5 years shoring up every aspect of the business and are now best in class in most of our technologies.
To let you know how bad it was when we started, Our backups were on the same storage array as prod and our Domain Superuser's password was known by everyone, including non IT, and hadnt been changed in 16 years.
momofeveryone5@reddit
Good Lord this raised my blood pressure! In the post you know a shit show is coming when a higher up says "you'd be perfect for this role!"
SnavlerAce@reddit
Great read.
tepancalli@reddit
Dear God... I got stressed out of reading this. It was a good challenge, congrats on getting such a project done
GovernmentOpening254@reddit
Was gonna say, s/he got a massive amount of “experience…” with stress.
prolapsedbrain@reddit
AI slop
ForcedChangeling@reddit
How long afterwards the meeting was the ‘Head of IT’ replaced?
harrywwc@reddit
I'm not sure "enjoyed" is quite right - blood pressure was certainly elevated for a bit there ;/
also…
re: "Carbon Black" - hell yes! (in a new topic though ;')
dheffe01@reddit
Oh man reading that made me anxious
generic-David@reddit
Holy shit!