3 DCs, everything is going to shit. DNS failing, authentication is effed. Please help!
Posted by Whyd0Iboth3r@reddit | sysadmin | View on Reddit | 206 comments
I'm not a "System Admin", but a PACS Admin. Our system admin is really a junior. He is doing his best, but not making much progress. We have 3 DCs, 6 (Main DNS server) , 7 (DNS) and 8 (DHCP server) (DNS). 8 was/is our PDC. It all started with 8 acting up. It didn't seem to be syncing with the other DCs. Admin tried everything he could find related to our problems, but nothing resolved. After a few hours, we decided it would be a good effort to restore from a backup from about a month ago, which we know it was behaving back then. Well, it all went to shit. Users are getting login errors, LDAP related, DNS is failing all over the place. We are at a loss. Don't know where to go, where to look, what commands to run to find out, what event viewer logs to look through. Please, any help would be greatly appreciated! I'll post more logs, events, etc as we find them and think they are related.
OneWarning event in Event viewer is the following.
The Security System has detected a downgrade attempt when contacting the 3-part SPN
ldap/DC7.domain.com/domain.com@DOMAIN.COM
with error code " (0xc000005e)". Authentication was denied.
TheDawiWhisperer@reddit
unrelated to your problem but what is a PACS Admin?
Whyd0Iboth3r@reddit (OP)
PACS Admin = Picture Archiving and Communication System.
A system that stores, retrieves, and distributes medical images and patient information. PACS acts as a digital library for medical professionals to access and review images, and it's often integrated with RISs and EMRs.
Primary_Program_7325@reddit
PACS (Picture Archiving and Communication System) Admin is a a person who manages Hospital IMAGING ( think Xrays, CT Scans, Usltrasound) systems. these can be very simple or vastly complex depening on the size of the orgs.
Mindless-Rub-4953@reddit
What is DC abbreviation?
Whyd0Iboth3r@reddit (OP)
Domain Controller.
DowntownOil6232@reddit
Domain Controller
TackleSpirited1418@reddit
I am guessing the OP has 127.0.0.1 as primary dns server on their DC’s … I see this often, but it,is completely wrong. Always use another DC as primary dns …
Whyd0Iboth3r@reddit (OP)
We don't actually, but good call.
jeffwadsworth@reddit
For future reference, set up a test environment of at least 2 DC and practice restoring them after deleting objects, etc. Use MS backup GUI and the command prompt methods to get familiar with the process. Essential to know this procedure. https://youtu.be/UMUGd23hvbk?si=sl7cs4w-TM3bZyfx
Whyd0Iboth3r@reddit (OP)
Excellent Idea. We have some test servers already installed. Just need to add roles and such. Thank you for the link.
p3aker@reddit
Hey bro, shitty sysadmin here. You guys did well to get back on your feet.
One question I have is why are the DCs called 6, 7 and 8. Shouldn’t they be 1, 2 and 3 lol
Whyd0Iboth3r@reddit (OP)
Because they were OS refreshes, and instead of in-place, they incremented. So they spun up 6 7 and 8, then they decommed the old ones. There were 5 from previous IT Team, and they were all 2008 R2.
Due-Mountain5536@reddit
omg i felt sick reading this, so sorry for you guys must been one hell of a nightmare
Petrodono@reddit
As a vet sysadmin, these best course of action these days if moving off to Azure is not an option, is to run DC as VM’s and do snapshot backups to your backup type of choice. Never restore using Microsoft’s methods, they don’t work. Also if a DC is killing auth, shut it down, and build a new one. Best limp with one less Authenticator then to screw up the domain. Also, in AD there is no such thing as “main” DNS. They are all DNS. DNS replicates so they are all equal.
xxdcmast@reddit
So don’t take this the wrong way because I know you aren’t an ad guy. But you guys fucked up pretty bad.
You basically never restore a domain controller. Especially one from a snapshot a month ago. You likely put the dc into usn rollback and a lot of really bad other things.
At this point your best course of action may be to write off the dc you restore as dead, seize roles and metadata cleanup.
But I don’t expect you or the junior admin to be able to tackle this with little/no experience. My recommendation would be to call Ms and pay the 500 bucks for a case and hope for the best. Or callin a local msp and see if they can assist for a cost.
Sorry to be the bearer of bad news.
Whyd0Iboth3r@reddit (OP)
I understand. I know we are in a bad spot. So should we never backup a DC? I could save 3 Veeam licenses!
jeffwadsworth@reddit
System State backup. Full bare-metal isn’t needed. Do one every day on every DC.
gargravarr2112@reddit
AD is constantly cycling Kerberos tokens for every machine on the domain. So if you restore from backup, then all the machines on the domain will have invalid tokens and be unable to auth. You do want to be backing up your DCs but you really, really only want to restore it if the entire domain has gone up in flames and the only other option is rebuilding the entire thing from scratch. That's why you have to know what you're doing when restoring.
Sorry, but you're really out of your depth here. I recommend enlisting an MSP or Microsoft themselves for help.
Synstitute@reddit
Where can I learn more about this?
ScreamingVoid14@reddit
Which part?
The gist is that there are a lot of moving pieces in AD and a lot of them are synchronizing to each other and also keeping track of the version number* of each item on each other DC for better synchronizing. So restoring one DC will immediately throw the entire thing off, especially since that one DC was the PDC, the one that resolves conflicts and is the priority for sync.
DowntownOil6232@reddit
Will there be the same issues if you only run one DC?
mish_mash_mosh_@reddit
When I worked for the local authority, they supported hundreds of different schools and colleges, all only had one DC. It actually worked very well. We obviously had to do a good amount of DC restore s from backups, but we never had any DC issues after the restore.
If worst case did ever happen and the DC restore from backup were to fail( I was there for 6 years and it never happened), they had a base dc image with most of the DC preconfigured, so it would only take a few hours to get the replacement domain up and running and a few days to sort the clients, but this never happened while I was there. It was agreed by the local authority that the trade off of having multiple domain controllers wasn't worth the time or money.
It's been a few years since I worked there, but I bet it's still the same setup.
bobsixtyfour@reddit
running one dc is not a best practice because if it dies, everything is gone if you have an issue with your backups.
DowntownOil6232@reddit
Yes I understand that. I was just wondering if the issue would still happen if there was only one. My guess is no.
ScreamingVoid14@reddit
Correct, there would not be the desync issues if there is only one. Although that has its own concerns and issues.
DowntownOil6232@reddit
Thanks for answering 👍
fireandbass@reddit
https://learn.microsoft.com/en-us/troubleshoot/windows-server/active-directory/detect-and-recover-from-usn-rollback
bcredeur97@reddit
You don’t simply restore one DC. You restore all of them at the same time lol
-_G__-@reddit
You have no idea what you're talking about.
bcredeur97@reddit
I mean if you have image backups of everything at a point in time 3 years ago, you can conceivably roll back the environment 3 years.
As long as you do EVERYTHING
-_G__-@reddit
You're doubling down on your level of incompetence with regards AD recovery, I see.
bcredeur97@reddit
And how can I use this negative comment to improve my life?
-_G__-@reddit
By taking it as proof that you need to study AD recovery processes.
jrichey98@reddit
Computer account passwords will be off, the more time has passed since the backup, the more computers.
myrianthi@reddit
No you don't. You turn all of them off and restore the primary. Then you build new DCs in place of the others.
tomaspland@reddit
This guy fucks ^^^^
Again ADRES workshop from Microsoft will walk you through and explain everything, and they help you build a customised nuclear recovery plan.
Just make sure to follow all the advice.
Even if you have AD recovery tools, I implore you all to learn how to backup/restore/redploy manually as you then have the knowledge to check the tools are doing things correctly and have a contingency plan of it doesn't go the way you hope.
Whyd0Iboth3r@reddit (OP)
Is it too late to do that? We could do that now.
ScreamingVoid14@reddit
They were not speaking wise words. Unless your backups were all taken within a second of each other, it isn't an option.
Whyd0Iboth3r@reddit (OP)
FYI... It worked. We actually did backup all 3 at the same time... Literally. We are now in a state where we were before he did the restore of the PDC. Stuff is still broken, but DNS works, people can log in, LDAP is functional. We have to fix DC8, but everything else is back to normal. Crisis averted. We literally Ctrl + Z 'd that shit. LOL I should buy a MF Lottery ticket.
ScreamingVoid14@reddit
0.0
I'm pleasantly surprised. You'll still have some stuff to work through, but it should be doable now.
ScreamingVoid14@reddit
You'd have to have very carefully configured the backup to snapshot all the DCs at the same instant. While theoretically possible, it isn't really practical.
-_G__-@reddit
Backing up and restoring DCs is fine as long as you do it appropriately via the MS supported and documented methods.
Dracozirion@reddit
I see way to many replies calling blasphemy on restoring a DC. They probably don't know how to do it.
TotallyNotIT@reddit
I think it started long ago as advice that, if you still have DCs that work properly, it doesn't make a lot of sense to bother to restore most of the time. Even with a non-authoritative restore, it's less complicated to deal with it and fuck around with burflags.
Over time, people took that reasonable advice and it filtered through people who don't really know what they're doing in a stupid game of Telephone spread over decades until it became nEvEr ReStOrE a DC EvEr!
DistinctMedicine4798@reddit
I agree, but often times in SMB you will find some application critical to the business on a DC and yes it’s not best practice but they would have to restore. Should just pay the licenses for server standard and split into different VMs
TotallyNotIT@reddit
This is a different stupid situation. I'm glad I don't have to deal with this fuckery anymore but yes, you're correct in outside cases.
-_G__-@reddit
I couldn't agree more.
JaspahX@reddit
Why even do it though? DCs are very easy to just replace. The only legitimate use case I can see would be a disaster where every DC was hosed.
thortgot@reddit
You absolutely want to back up AD but you need to know what you are doing on restore.
DarkAlman@reddit
^ this
Ban_Master@reddit
Happy_Secret_1299@reddit
Reverse_Quikeh@reddit
Ros_Hambo@reddit
\^ this
alpha417@reddit
+1
mrbiggbrain@reddit
In a perfect world you have an issue and so you bring up a new domain controller, add it to the domain, seize any required roles, and properly demote the old one.
It's all about knowing what to do when you can't do part of that. In general restore from backup is a last resort because there are lots of gotchas when you do. The backups should exist because they can be used to bring up a single healthy node in really big failure scenarios.
Let's say something happens and you don't have any healthy DCs. You could restore a non-rid (RID is a role) domain controller, usually the PDCE. Then use the perfect world solution to add new domain controllers to get back to the correct number.
Even then there is lots of cleanup that increases the longer the backup sits. One from a month ago is going to save you some time, but your going to basically be manually fixing every computers trust.
pssssn@reddit
You can restore a domain controller with Veeam but it has to be done correctly.
https://www.veeam.com/blog/how-to-recover-a-domain-controller-best-practices-for-ad-protection.html
BornAgainSysadmin@reddit
Irrelevant to OP's issue, but I just wanna say Veeam app backups for AD have been super helpful over the year for me. Latest issue was a GPO that was acting up. I forget why, I think it was something dumb I did. Restored the object from Veeam, and all was well.
DarkAlman@reddit
Seconded: The ability to restore individual users and GPO objects from Veeam is a F***ing lifesaver!
SnaxRacing@reddit
My manager is hellbent against using Veeam and we are now only doing full image backups from our RMM. Pray for me boys
ResponsibleBus4@reddit
Then turn on the recycle bin at the least if you can.
SnaxRacing@reddit
All customers have it enabled… I’ve tried my best to mitigate anything I can. But with most customers being very small orgs, we’re looking at single DC Active Directories so… YOLO?
HJForsythe@reddit
Why not just use Azure AD and do DHCP in their firewall, etc?
hxpttrn@reddit
This!
HJForsythe@reddit
To be fair it shouldnt be nearly this complicated if only they werent carrying over code from NT 4 in 2024
tomaspland@reddit
Using a AD or backup tool is fine, but you should still understands how the actual mechanics of AD works to ensure you are informed in case the tool doesnt work as intended.
Candle-Different@reddit
Even veeam tells you there is inherent risk in doing so though.
Jumpstart_55@reddit
Does this apply to veeamzip as well? My home lab has 2 2019 DC just cuz hyperv. Didn’t want to waste 2 licenses for them so every month I veeamzip them to my NAS.
THE_Ryan@reddit
Definitely backup your DCs, but you have to do it correctly or else additional intervention is needed after the restore.
Also, restoring from a month ago isn't usually going to go well for your users. Most of the auth won't work right away and the trust relationships for the machines will probably be broken.
tomaspland@reddit
Ask Microsoft to quote you for a ADRES (Active Directory Recovery Execution Service) workshop
https://download.microsoft.com/download/A/C/5/AC5D21A6-E04B-4DC4-B1F2-AE060319A4D7/Premier_Support_for_Security/Popis/Active-Directory-Recovery-Execution-Service-[EN].pdf
It wont be cheap, but will enlighten the poor sod of a junior sysadmin, give them a much deeper understanding of how AD works and how to monitor and thus prevent replication issues etc from snowballing. Prevention is better than the cure!
ehode@reddit
You want to be backing up but the restore requires to you pick one of the paths outlined for restore. Partly comes down to not letting a lot of the AD data get all out whack/mistimed.
budlight2k@reddit
Yes back it up but there is a process to restore it. You can't just restore the whole VM.
b4k4ni@reddit
Backing up a DC is important too. But restoring it the right way is a different matter. That's why you have more then one. Basically the only reason to restore is, when all DC are gone. Then you restore all of them. And hope your DRS pw is saved for all dcs.
ScreamingVoid14@reddit
Always have backups, but unless everything died, you are generally better off writing off a dead server and doing a fresh install and promotion. There is very little/nothing that a DC keeps locally that isn't also on the other DCs.
The backups will be used in case of a full loss of all DCs. You will restore that latest backup and then do fresh installs for the others.
myrianthi@reddit
You SHOULD backup the primary DC in the event of some catastrophic loss where all of your DCs shit the bed. Restoring it requires turning off all of the others though so that it can't communicate with the busted DCs. Then once it's up, you work on standing up new DCs on place of the others which were turned off.
InevitableOk5017@reddit
Jezus my friend, have you done any back studying of a mcse cert?
802-420@reddit
Since you're using Veeam, you may be able to engage their support to assist with the restore. I'm not a Veeam client, but I get that level of support from my backup vendor. They will be far more responsive than MS and you're probably already paying for support.
ihaxr@reddit
You don't need to use veeam to backup the DCs and you only need 1 backed up.
Windows built in backup for AD stuff off site for a complete disaster recovery restore. If a DC blows up, just build a new one with the same IP and let it replicate from the working servers.
ephemeraltrident@reddit
Others here are right, you are in a pickle - but find some specialized help and you’ll be fine. From what you’re describing, your systems should be returning to functional with a few hours of work, and you’ll likely put out little fires over the next week or two. You’re not hopeless, you’re just in a bad spot right now.
triktrik1@reddit
Quick question, I’m just trying to understand the consequences. But why would you not want to restore a DC from a snapshot
Dracozirion@reddit
USN rollback issues only occur prior to server 2012 combined with Hyper-v 3 or vSphere 5.0. Anything higher will not have this issue of you restore from a snapshot.
theotherThanatos@reddit
This is false, I just had a dc go into usn rollback on a 2019 server after pulling from a snapshot. Had to force demote and clean up metadata
Dracozirion@reddit
It is not false. Maybe you were using a hypervisor without Gen-ID support or had other issues. This should not happen.
https://learn.microsoft.com/en-us/windows-server/identity/ad-ds/introduction-to-active-directory-domain-services-ad-ds-virtualization-level-100#virtualization-based-safeguards
Madd_M0@reddit
We just ran into this issue with a few of our DCs that were server 2019. Had to seize rolls and decommission the DC.
fireandbass@reddit
Same, we had a USN rollback on Server 2019 when a DC was moved from one host to another while powered on. Thankfully, we were able to restore it with Veeam, which is AD aware.
bartoque@reddit
Which still doesn't seem to be that smart a thing to do when doing an authoratative restore with a reported backup from a month ago as stated by OP? The backup from last night, possibly yeah, and only at that when the whole setup would be pretty much completely screwed?
Most AD admins when asked didn't even ever perform an non-authoratative restore, let alone an authoratative one. Pretty much always adding a replacement system and promoting them.
Only we now see - being the backup admin myself - that by giving admins the option to perform restores in a network wise completely shielded off environment, that they would even be able to test a complete DC DR by doing an authoratative restore being able to actually test rebuilding things from scratch, without affecting production...
Dracozirion@reddit
That's right, not ideal with a backup of one month old. I was mainly replying to xxdcmast and not to OP.
xxdcmast@reddit
Usn rollbacks is still a thing but yes generation id on virtualized systems was designed to help.
I still wouldn’t ever restore a dc if I had others authoritative or non authoritative. It’s trivial to metadata clean up and build a new dc which won’t have the risk of all the problems here.
If you like doing non authoritive restores then have it at.
No_Nobody_7230@reddit
I don't think the $500/case is a thing any more.
crypticsage@reddit
Would restoring it to the previous day before they did the restore help?
I’m thinking at least this way it goes to a recent configuration. Then move the roles to another dc and demote the primary.
xxdcmast@reddit
No it will still be in usn rollback and likely still be a host of other issues.
The only time you really restore a dc is complete domain compromise. Then you restore one and only one dc and rebuild from there.
If you have more than one dc and you should the correct way to handle a failing/failed dc is demote or dirty delete metadata cleanup.
kozak_@reddit
Agreed, fix is to get to one DC and rebuild. Per Microsoft, USN rollback recovery is removal of problematic DC.
https://learn.microsoft.com/en-us/troubleshoot/windows-server/active-directory/detect-and-recover-from-usn-rollback
VNJCinPA@reddit
Demote and decommission DC8. Do metadata cleanup in AD. VALIDATE.
Install new DC.
That's how you should wrap this up
Ok_Presentation_2671@reddit
Dcdiag would be a start
mooboyj@reddit
Engage Microsoft, they'll fix it. It'll be a few hundred $$$ but well worthwhile.
I had this done at an old MSP as a tech had failed a forest upgrade and not told anyone... He left and I inherited it and we engaged Microsoft and they resolved it with maybe 12 hours of work.
Let_There_Be_Pizza@reddit
How do actually contact mi Microsoft here? I have absolutely zero clue in a case of emergency.
Phate1989@reddit
There is a support portal, enough googling and you can find it.
I think you need to login with a non-business account or you just get redirected to 365 support.
WesternNarwhal6229@reddit
To avoid this in the future look at Cayosoft. They have standby forest recovery only solution on the market that has this capability. You will never have to worry about recovery AD again.
jkeegan123@reddit
Call Microsoft, pay the 500$. Or call an msp partner and make a lasting relationship that you can lean on in times like these.
mcshanksshanks@reddit
Pours one out for a homie
1RedOne@reddit
We had the great Stephen outage of 2011 when our ran a Powershell script to make new users
It was supposed to copy all group memberships from user A and add user B to all of them.
Instead, I misunderstood the function of the power shell command, and it deleted all users from all groups that The user A was a member of, and made user B the only member of all of those groups
Wouldn’t be a big deal but for the fact that we used these memberships for parking deck or building access and for phones and for everything
The phone immediately started ringing after I ran my script
The best part is that it would have saved me about five minutes of work once a month. Instead we had an all hands on deck 48 authoritative domain restore scenario
Thank god for our remote backup domain controller which was in a slow sync schedule about 100 miles from home office
It was recent enough to become our new PDC and we just resynced from it back to home office
I was definitely showing up early and bringing donuts and buying the Friday beer lunch for my coworkers for a few weeks after that
SpiceIslander2001@reddit
LOL! I think I'm going to rename The Great Password Reset of 2018 to the Kevin Event.... :-)
manvscar@reddit
And I'm going to name the great SCCM wipe-and-reinstall-100-staff-pcs "The Great ManVsCar".
mcshanksshanks@reddit
pours two out for this homie
Rowendk@reddit
Is the time set correctly on them all?
DarkAlman@reddit
Well that was your first mistake...
Restoring a Domain Controller requires a bunch of extra steps and should only be done in a DR scenario.
Either pay MS for support or call in a consultant to help you.
Your PDC is probably in tombstone mode now, will require manually intervention to fix.
Fallingdamage@reddit
OP didnt go into deeper detail, but aside from probably taking a high risk with restoring from the old backup, I didnt get the feeling that they had a backup plan for that. "What do we do if the restore makes things worse?" should be asked before taking that step.
I have had to troubleshoot a lot of odd Domain issues and have cleared many of them up over time. Every environment is different but odds are with careful examination, each problem can be isolated and worked on. Even the gremlin-like nuances that dont have solutions but only workarounds. It sounds like Jr was just playing whack a mole with google as their guide without (possibly) understanding what each thing was going to do.
DarkAlman@reddit
To be fair to them that would have been a perfectly reasonable course of action for any other server other than a DC.
That's exactly what happened.
I consult for a living, and I tell my customer all the time
"Just pick up a phone and call me, 5 minutes of advice from me can save you hours of downtime"
My hourly rate is nothing compared to the downtime these guys are facing now :(
manvscar@reddit
I'd go a bit farther and say to demote and seize the FSMO roles of the PDC and then just completely wipe and rebuild it. You never know what registry settings or other strangeness might persist even after demoting.
Legionof1@reddit
I agree, never demote promote.
architectofinsanity@reddit
Demote destroy
TheBeckFromHeck@reddit
And build from scratch. Don’t use a template for the new VM.
manvscar@reddit
Absolutely
about90frogs@reddit
Thanks for the explanation, that was a good write up and it taught me something.
MethanyJones@reddit
I would open an incident at this point. The cost of the downtime is likely huge compared to the incident fee
evantom34@reddit
Thanks for the rundown. I haven't had this happen, so it's helpful!
JustInflation1@reddit
Sounds like technical debt from no IT. Tell your company it is time to hire IT.
JJHunter88@reddit
I've rarely seeing a backup of a DC work correctly after being restored.
Are any of the DC's working correctly? Usually you stand up a new server, install DC rolls and promote them, then demote and remove bad server.
Fallingdamage@reddit
I have learned to keep a PDC and SDC running - AND keep a third DC replicating quietly with no other roles its wheelhouse to use as hail-mary promotion if the domain goes south.
tombstone the old original two DCs, kill off all the DNS servers and DHCP servers, use the third DC for hostile takeover and build out the whole cluster of servers and their roles from the newly promoted PDC. Top-down. Dont try and bandaid things laterally if its become a spagettified mess.
I have even introduced a third DC for the sole purpose of taking over as the head while cutting off the rest of the body.
Whyd0Iboth3r@reddit (OP)
It's hard to say. 6 is the main DNS server and it is hit and miss. We can try to stand up a new one, and attempt those steps. But before he restored 8, he did try to change the PDC to 6, but it gave him an error about not being able to contact the DC6. So it wouldn't take.
DarkAlman@reddit
For those reading this later:
You needed to use the -force tag in the FSMO transfer powershell cmd to move the roles when the PDC is damaged or offline
tankerkiller125real@reddit
You can restore a backup DC, however, the first step to that is killing all other DCs you have. Then forcing the removal of the old DCs on the restored DC, and rebuilding all other DCs from the bottom up using the restored DC as the source of truth.
Basically the only time you should ever do it is if AD is already super ultra fucked from something like ransomware.
jsedgar@reddit
Bite the bullet and contact Microsoft. Or a company that has Microsoft support.
Fallingdamage@reddit
And make sure to tell them you already reverted all the commits to a month ago.
SleestakWalkAmongUs@reddit
Now is when you bring in an MSP. As others have stated, you're a bit in over your head with this situation. Nothing about it involves any sort of fun either. Call in the pros.
Fallingdamage@reddit
Ah yes. Domain is down, shit has hit the fan. Nothing like engaging an MSP so you can sit in teams meetings for 3 months talking about the problem. Planning, Discovery, Proposals, Remediation, Plugging-for-sales-department, 'Project Coordinators', Jerry the Rockstar who comes out and runs utilities on USB and grumbles about your environment, etc.
In the meantime, domain is still down and costs are racking up.
sambodia85@reddit
Yep, they don’t have enough understanding to correctly plot the right course out of this. 8/10 chance they will make it worse, even with great advice on here, there’s just too many moving pieces.
Proof-Variation7005@reddit
and while there's been good advice dished out here, i don't think it's unfair to say OP and the other admin could very easily take a wrong turn in the recovery.
i'd say most of the advice is just a "here's what the company you're gonna bring in will/should say" rather than "go do this"
mrtuna@reddit
They already did when they restored a month old DC
OCTS-Toronto@reddit
100% This! You goofed with the restore. It could be saved but it needs experts in AD and you arent going to get a quick fix on Reddit. Call the professionals and get them to fix itm. They can then set you up to maintain it long term.
matman1217@reddit
Can you replicate all of the working domains to a brand new build of a DC and then setup and sync all of it into azure? Curious why you are running such an old setup anyways.
Whyd0Iboth3r@reddit (OP)
the costs of licenses is astronomical to us.
matman1217@reddit
How many users?
Ezzmon@reddit
TLDR; Never restore AD if there's any possible way around it. 'Restore from backup' is the nuclear option.
It's very common to omit DCs from full backups. SYSVOL perhaps, but not the application. Rule of thumb is; problem with a specific controller?--> transfer FSMO Roles to another and shut it down, build a new one (after some troubleshooting, of course).
Another rule of thumb; DO NOT run any other Roles on a DC besides AD and DNS Global Catalog. If you need DHCP services running alongside, build another single purpose server.
VirtualDenzel@reddit
Rule of thumb. Dont believe anything you read on here. Build a server purely for dhcp? Nah . Just diagnose problem properly. You can rin ALL roles on a single dc if you wanted.
Just setup proper failover. And no full backup dor dc's? That is retarded.
Ezzmon@reddit
Been a long time since I was called retarded.
Maybe crack open an MS recommendation every so often. If you have critical services all running on the same machine, guess what goes down during every reboot. Everything. I agree with you about failovers, except with DHCP. Anytime someone does manual change in DHCP, the scope then has to be manually replicated.
Most companies run VMs, and any company running VMs with stacked Role servers is asking for heartache.
VirtualDenzel@reddit
Well someone had to tell you. Stacked role servers are no problem. You just have to know how to do your job. Our environment is one of the most complex ones in the netherlands due to all connections between militairy,government and the banking sector. Full backups. Full restores. No problem. Dhcp replication is fully automated. Also with manual changes. It is not 1995 anymore.
Nothing you said made sence in 2024. Maybe in the 2000's. And crack open een ms recommendation? Lol you ever seen 1 ms best practice document that works 1 on 1 in production. They do not exist.
Enough time wasted. Thank god you do not work for my team.
Ezzmon@reddit
Indeed thank god I dont. If I had close proximity to someone with your attitude I’d lose my hair. But, your type isn’t uncommon. I hope your snottiness serves you well, somehow.
VirtualDenzel@reddit
Nah you would be sucking up to me , since id be your boss. There is a reason dutch citizens go through our different systems at least 19 times a day.
They would not even let you handle diaper replacement software in kindergarten.
Come back when you actually know what you are talking about 🤣🤣🤣. Aka see you never again. Bye kid.
Ezzmon@reddit
OK, Mr. 'works at important military\banking sector vendor but thinks running multi-role domain services is 'fine''. Are you only given 1 server? JFC. That statement contains all anyone needs to know about taking advice from you. Have fun with that.
VirtualDenzel@reddit
Lol son, stop talking. You are just showing exactly what a fool you are. 🤣🤣. Multi domain services can be fine , is it depending on size and setup. Sure. You really have no clue kid. Im sorry for you. Maybe one day once you finish high school i will give you an internship. But i dont think that with that attitude you will get past helpdesk. Good luck in life. It must be hard being you.
P.s this is my last message. I know you will respond to it since thats exactly the kind of kid you are. But i wont bother reading it. You are not smart enough to understand when you are talking to someone way out of your league.
😘
Hsensei@reddit
Sounds like sync issues. Demote one of the secondary dcs and then promote it again. I bet that fixes it
manvscar@reddit
Something that a lot of younger sysadmins don't realize is that domain controllers really are meant to be "disposable". This is why if possible you should never install other roles or services on a DC - if it starts acting up it's usually easiest just to demote it, delete, and fire up a new box/VM to promote.
In my younger, more inexperienced days I had a physical PDC which was also running a DHCP server. The RAM went bad in the box and it started having serious issues to the point that I couldn't even log in.
In hindsight, the best procedure to fix this would have been:
1) Shut down the failed box
2) Restore from a backup to a VM without network to retrieve the DHCP scope without introducing old replication data
3) Import DHCP scope into a new DHCP server
4) Turn off and remove the restored DC VM
5) Seize FSMO roles to a functioning DC
6) Rebuild new DC, and optionally transfer the FSMO roles.
But instead, I did the unwise and restored from a backup (using our backup tool) that was a couple days old. Luckily, this was on the weekend and not much had happened in AD, and the restored DC did actually resume replication. I ran into a few GPO issues, but overall I was lucky and was able to get everything functioning again. But, again I was lucky, and it wasn't until I found some of these minor GPO issues that I learned that simply restoring a DC from a regular backup will almost always break things, and if the backup is especially old, it could completely fubar your AD.
The only proper way to restore a domain controller is using Directory Services Restore Mode. You boot to this mode and recover AD in one of two ways: 1) Authoritative and 2) Non-authoritative
Authoritative tells all other DC's that this restored backup is the "source of truth" and it will replace all other data.
Non-authoritative tells the newly restored DC that it is to only "pull" replication data from the other DC's.
https://4sysops.com/archives/recover-active-directory-domain-controllers-with-nonauthoritative-restore/
So you can restore a DC in these ways, but the truth is neither of these ways are ideal. They are honestly more difficult than just forcefully demoting the bad DC and building a new one.
If you're in an "only" sysadmin role, this is a situation that you absolutely have to be prepared for. DC's die, and when they do, leave them dead a build new.
GreenHairyMartian@reddit
The phrase I like is to treat your servers is like cattle, not pets. Cattle get processed and only last a few years, they aren't pets that you take care of for as long as possible.
Proof-Variation7005@reddit
Given the level of staffing you're running with, this server setup seems unnecessarily complicated. How big a network are we talking?
You could easily just have PDC / DNS on 1 server and the other backup DC / DCHP / primary DNS. You might be small enough to justify have DHCP/DNS/AD running on 1 server with a backup DC/DNS
I'd also agree with people whove suggested calling in an outsourcing person.
My gut feeling is save a copy of the DHCP database, turn 6 and 7 off completely, restore 8 to something as recent as possible, then testing to see if machines work, you can change a password, etc. Then you'd delete all references to 6 and 7 in active directory like they got thanos snapped out of existence
Then you format/reinstall ONE of them and make it you're backup DC/DNS. DHCP can go on a domain controller for a smaller network without an issue. You could have a dedicated DHCP server that isn't a DC too. Hard to really say. Hell, you could recreate the same setup you had and just have someone sanity check you along the way so the DNS problem that caused this is caught.
flexcabana21@reddit
Was the old admin just building stuff for fun or incompetence
jrichey98@reddit
Trust me, you always want more than 1 DC. We have 2 per site, but it's not a bad idea to have a third PDC (call it your management DC) at your primary site. Ideally you want them on different hardware.
Multiple DC's are needed for HA as well as fault tolerance in case of an issue with one. You don't want to take down services because of a windows update. Well the DC is updating and now sharepoint and exchange have crashed, and people can only log in on cached credentials and will be off their domain account until next reboot/login.
flexcabana21@reddit
No one is say no to reducing but why is a place that currently has no Sys Admin have 8 DNS servers. Anything more than 3 of each I’d expect at least a team of 2 to 3 people that can mange this infrastructure. Not someone running to Reddit for a quick fix.
jrichey98@reddit
They stated they had 3 DC's, which is a reasonable number for a domain/site. Since their admin is Jr, I didn't want them getting the wrong idea about multiple DC's being overcomplicated.
I think the confusion comes from them talking about 6 7 & 8. My assumption is that they are referring to them by IP: x.x.x.6, x.x.x.7, & x.x.x.8. The x.x.x.8 DC was the one acting up and was the PDC. My interpretation of course.
Proof-Variation7005@reddit
It kinda reeks of “I read best practices are all this shit gets its own server” with no regard for scale lol.
Nexus1111@reddit
😭
bitanalyst@reddit
Are you by chance using CrowdStrike Identity? If so try turning off LDAP/LDAPS inspection.
Whyd0Iboth3r@reddit (OP)
Nope, we are not.
godzilla619@reddit
I want to know who talked the sys admin into restoring the whole VM from a month ago?
McClouds@reddit
OP is a PACS Admin, so they work at a hospital or some type of imaging facility. Quite possible the server/domain sys admin is just a junior admin, and the IT manager is a nurse who once made a really good excel document.
Honestly sounds like something my hospital would do. Luckily there's enough seniority that stuck around after multiple restructures to tell people a bad idea is a bad idea, but we're leaving slowly.
We just had a downtime for our PACS that lasted half the day uploading security certs because CAB wanted to minimize downtime and apply patches during the reboots required to apply certs. Broke LDAP, no one could log in until all certs were applied across 20 servers, and each server required the previous month's windows updates to install on reboot.
Wasn't very smart, and it was signed off by everyone who can approve changes. No one asked questions because they don't know what questions to ask. It's the death of expertise.
ConfectionCommon3518@reddit
If people are panicking and hoping for a quick solution just take a mandatory cig break even if you don't smoke as there's lots of sh!t flying everywhere and you need some time to think.
eoinedanto@reddit
Call in Third Tier as IT paratroopers who can tell you what can be saved here
datec@reddit
Truly r/shittysysadmin content...
newton302@reddit
TOO SOON
LuffyReborn@reddit
Ok so first whenever a domain controller goes shit and the usual methods to make it replicate fail.
IMPORTANT: NEVER RESTORE FROM A BACKUP AT VM LEVEL!!!
There are tools from MS and other vendor that work with that type of situations. And most importand if its only one, there is always the option to demote it, metadata cleanup and recreate the box with same name ip it will replicate and things will go to normal.
I saw some responses in this topic that you should power down the other DC that are not FSMO holders (reply only mentions PDC) , and restore it. All the orgs I have been with masssive prod infrastructure will not afford this approach.
Glad the OP was able to fix but he made things much harder due lack of experience. Its not bad shit happens, making this comment for future folks that might find this thread.
kozak_@reddit
Yeah.... Never good to restore a member DC. Always add an additional DC and then rename / re-ip.
If this was my environment I'd pull a couple of hours and overnighters to do the following:
But.... You might want to get Microsoft support involved . Would probably be cheaper and faster
shagad3lic@reddit
I skimmed through reading so this may be redundant. You did screw up by restoring the domain controller from backup because you had 2 others there. That the whole point of having multiple DC's. That's ok, shit happens, now you know.
If it were me, i would shut down the DC you restored, its as good as dead right now. The hope here is that the restore probably has an old AD schema/database revision which is lower than the other 2 DC's, therefore they would try to update the one your restored, but most likely failed to do so because the one you restored may have held all the FSMO roles.
So you shut it down, reboot the other 2.
Seize the roles using ntdsutil (plenty of step by step articles) pretty strait forward. To whichever DC you choose. If one is 2016 and the other 2019, the obviously choose the 2019, but there are other factors the weigh in.
Then update the DNS settings on the networks cards (or network team) of each of those servers. If they are VM's, you dont have to worry about nic teaming. You update the DNS on each server NIC. Primary DNS on each local DC points to the other server, secondary DNS=127.0.0.1 (itself)
now reboot again. hopefully if your are lucky, login ability is restored. If so awesome.
now you have some cleanup to do. Go to dsa.msc, go to domain controller OU, r-click, delete the server that you shut down.
go to sites and services, delete the server you shut down in there
open DNS mgmt (dns.msc) and you want to clean up dns entries for the old server in there. name servers. Go to forward lookup zones, right click on each zone and choose properties, click name servers tab, delete the old DC/DNS server from there. If you have reverse lookup zone configured, you want to go in there and do the same thing.
That should get you back up and running if you are lucky. There is more you can do, but its friday night, i'm half drunk but was motivated enough to help a fellow IT guy out, but im going back to football and drinking :)
manvscar@reddit
Excellent and thorough advice here.
dedjedi@reddit
DM me and we can discuss rates
Whyd0Iboth3r@reddit (OP)
Thanks for the offer, but we aren't going to hire random guy from reddit. LOL I would consider it for personal stuff, but the company wouldn't.
michaelpaoli@reddit
But you'll take your sysadmin advice/instructions from social media (e.g. Reddit)?
Whyd0Iboth3r@reddit (OP)
At least I can take the advice from here and verify it elsewhere. Having some dude log into our site to do repairs, is a whole different story. And MSP has insurance, and we'd have a contract.
michaelpaoli@reddit
Well ... you can pay some random dude for advice and verify it elsewhere.
;-)
judgethisyounutball@reddit
So it seems like you would be ok with rolling back to your AD environment from a month ago. If that's the case then, as mentioned earlier, the other two DCs need to go offline, restore 8, punt 6 and 7, do meta data cleanup and for the cleanest path forward format,reinstall 6 and 7, give them new names, promote them, setup roles, and address any issues you see moving forward with the old DCs in the forest, the new names will make identification of entries from the old DCs that much easier (like any ntds settings that may have been missed during cleanup). Depending on the speed of the machines/restore processes/windows f*cling updates/ you could be back up and running inside of a 6 hours. Quicker if you can reimage 6 and 7 and run updates while restoring 8.
jooooooohn@reddit
This is likely what I would do outside of paying Microsoft to fix it.
dedjedi@reddit
you're taking advice from random guy on reddit, btw
BornAgainSysadmin@reddit
What u/judgethisyounutball posted could likely be your simplest path forward and might be what I'd try at this point. There may may be some residual issues with client servers and machines with outdated machine keys and other issues that will have to be handled after getting AD going.
As for paying someone for help, seriously consider opening a case with MS fornthis. I forget what the cost is these days. It might be $500 per incident.
Canecraze@reddit
Call Microsoft and pay for help. Years ago, this cost $500. IDK what it costs today. They will help you, if your situation is salvable. Open a P1 ticket but be prepared to work on the issue non-stop until it's resolved.
anonpf@reddit
First steps to troubleshooting a domain controller are
Repladmin Dcdiag
Checking the health of the domain controller and replication status helps a ton.
As far as recovery goes, take the DC your restored from backup offline, force fsmo role onto another DC, and verify logins are restored. Any systems that are pointing to the bad DC for authentication will probably need to be rebooted. Rebuild DC8 from bare metal, configure per your documentation, go through the dcpromo process and allow the dc to replication from its partner dc. No need to change fsmo roles back unless you need them to be on dc8 for some reason.
For future reference, I ran repladmin and dcdiag on a daily basis just ensure I knew how my dcs were running. I never liked the scream test for these systems seeing as they were too critical for that.
manvscar@reddit
There's a handy DC Check report script floating around that runs both tools and then emails a nicely formatted report. I make a habit of running it daily.
anonpf@reddit
Yea I created a poweshell script to run daily checks as well. I just sent it to text though.
Rarely did we ever come across issues with our DCs, but the ones we did come across were major enough that we needed to rebuild and replicate.
manvscar@reddit
It's honestly a really good peace-of-mind tool as well. Running it daily means you always stay on top of any issues.
It might be different for other sysadmins, but the thought of losing AD is the most stressful for me.
anonpf@reddit
Oh for sure. We were on top of our AD infrastructure.
I agree with you completely. Losing AD is losing like your keys to the house. You ain’t getting’ in.
MDKagent007@reddit
oh man you never, ever restore a dc; you might as well start building the network from scratch...
ifixedacomputer@reddit
Pick 1 domain controller, to the best of your ability that is the most current and NUKE all the other DCs. Make sure it has all FSMO roles, you can use powershell to set these.
Google how to demote a DC that you cannot demote through role removal and clean up all meta data to the rest of your DCs that you will be nuking.
Once this is done start cleaning up AD objects like users and get passwords reset and your core users back onto work.
Folder redirection may have issues but it's not a big deal, as users login they may get new redirected folders just move their data to the new folder.
Share drive/ security group membership will probably be fucked, just focus on getting users that generate cash flow for the business back online.
Workstations are probably fucked to in this scenario so just rejoin them to the domain. If you have a subdomain like sub.domain.tld you can skip taking the machine to a work group and just type in the "sub" part of your domain if DNS isn't totally fubar.
Speaking of DNS make sure you update all your routers lan interfaces DHCP servers to only point to your singular DC that you won't be nuking.
Also make sure every site/router can reach your singular DCs subnet, May need to setup ipsec/wireguard/openvpn tunnels or if there's a VPN/Rad server on the subnet or routablr to it configure VPNs on each client that is mission critical and makes the business money.
I'm probably missing stuff but the general idea of this comment is that you rebuild your environment off of the DC in the best shape to get your core people going and once that is done your start building new DCs off the one you decide to roll with.
I recommend this if you can't get anyone with experience that knows how to fix an environment when AD shits the bed.
Good luck, keep a peaceful mind to the best of your ability, you will make it through this and be better off because of this experience.
budlight2k@reddit
Wow for the love of God stop doing stuff, your on the brink. everything you described starting with the restore of a PDC is making it much worse.
At this point an AD professional needs to look at the status of your domain and all credible options.
Get services from Microsoft or a reputable MSP.
fireandbass@reddit
Domain controllers do not require the DHCP Server service to operate and for higher security and server hardening it is recommended not to install the DHCP Server role on domain controllers.
dunnage1@reddit
If I remember correctly, that error code is happening because you’re trying to sync with the pdc that you wiped.
Like everyone said. Backups need to be done meticulously and correctly.
I’d go with opening a ticket.
You can try repadmin /syncall /AdeP on the pdc to force replication but I think it’s moot point at this time
tch2349987@reddit
You can create another DC and see if you can promote it, shutdown the other ones and see if the new one works correctly, then you can start planning on what's the next step. Last thing you can do is rebuild them.
thortgot@reddit
If replication is having issues, it's unlikely you can promote anything.
In a scenario like this, taking all 3 existing offline, restore one (PDC or not) resolve the rep issues, then rebuild the remaining 2.
manvscar@reddit
Yes, I would focus solely on getting just one DC functioning and users authenticating. Once you have one working then forcefully demote all others and then build new to replace them.
They may have one DC that is still functional.
jrichey98@reddit
If they're out of sync demote won't work. You have to clean out DNS, then you just promote a new VM. Honestly I'm not even sure if the DNS clean is required if you rebuild to same name & ip (which we always do). I've been there and done that, but it's been a while.
robotbeatrally@reddit
I'm not very experienced in this, but that was my first thought too.
jrichey98@reddit
You could also try to force replication from your best one:
Alternately there is a way forward:
Test-ComputerSecureChannel -Repair $(Get-Credential)
Hopefully this is recent enough that not too many systems have updated their computer accounts with a bad DC.
It's completely recoverable. It's just a question of how much of a pain it's going to be. In the future if you have an issue with a DC's, just offline them and rebuild them which is no big deal.
Useful commands for checking replication:
Replication is something to keep on top of. You don't notice it immediately when it breaks because things work for a while until computer accounts start being updated. I've personally been trying to figure out what's wrong with exchange, then started having issues with other services/users, only to realize a bit later that one of our DC's is out a week.
naus65@reddit
Call MS support.. it's $500 bucks.
FenixSoars@reddit
Oh boy, which health system?
Whyd0Iboth3r@reddit (OP)
It's not a health system.
FenixSoars@reddit
You mentioned PACS admin, I just assumed lol
Whyd0Iboth3r@reddit (OP)
It is an imaging company, but not a major health system.
myrianthi@reddit
When you restored a DC from backup you took all the other DCs offline, right? ...right?
Whyd0Iboth3r@reddit (OP)
nope. Probably what got us into this pickle.
myrianthi@reddit
Yeah. Well you could do what u suggested in my other comment. Take all of the DCs offline and then restore again from backup. That's what I would try next, but it might be best to contact Microsoft and have one of their specialists work on this. It probably won't be cheap but I'm sure it will be worth it.
SCUBAGrendel@reddit
I just worked this exact error with Horizon VDI. Check GPO settings to make sure that RPC is not locked down too tight.
SpiceIslander2001@reddit
The recovery process I might try, seeing that you have only three DCs:
Check the event logs on the DCs to see which one is successfully authenticating most of the time.
On the DC that's confirmed to be working, seize all FSMO roles.
Shut down the other DCs, i.e. power them down. Take this opportunity to move DHCP to another server that's not a DC. Check and confirm that authentication is working for mostly everyone. A few passwords may have to be reset, and a few computers may need to be rejoined to the domain because, well, the AD was borked. Check the security event log again to quickly determine where authentication is failing.
Once all authentication is working as expected, delete the other DCs from the domain.
Build new server OS installs, configure them with the IP addresses of the old DCs if necessary, promote them to DCs.
I agree with the others though - if you're not familiar with this, make the call to MS for support.
Kahless_2K@reddit
It's probably too late for this to help you now, but the first thing I would have checked is the time on all DC.
rose_gold_glitter@reddit
Seize the FSMO roles from another DC. Check you don't have the current pdc hard coded in any policies or scripts. Basically prepare to demote it.
Legionof1@reddit
Hire a real admin, we don’t work for free.
Ragepower529@reddit
You have 48 hours gl
Cormacolinde@reddit
Call a local IT consultant. You will not fix this by yourself.
mrfoxman@reddit
See if you can pull an IFM, stand up a new machine and promote it, seize fsmo roles, and then start rebuilding the 3 off the new one.
muzzlok@reddit
This makes me laugh. Please continue with this fiction.