Determining root cause of workstations losing trust relationship

Posted by Florida_Wrangler@reddit | sysadmin | View on Reddit | 73 comments

Hey everyone, I'm a jr sysadmin I'm looking for some advice on this issue.

I work in an office for a company that has a hybrid AD environment. In the several months I've been here, we've had 12 laptops lose their trust relationship with the domain. I'm not sure if this is typical, but at my last job I worked remote help desk, and this issue rarely happened. When it did, it usually meant the person had been out for an extended period and hadn't logged on. Which is not the case here, all of these instances have happened in the middle of the day.

I can resolve the issue fairly quickly with a powrshell command or just plugging it in directly to the network. My boss on the other hand prefers to rejoin the computers to the domain and rename them when this happens.

I'm concerned there may be a larger underlying problem. I'm not sure if it has something to do with the fact we reserve IPs for all workstations on both the wired and wireless network.

I'm looking for some advice because the historical solution has been to rename the device, rejoin it to the domain, and move on. The problem is that this can cause significant downtime for the affected user, especially if they can't get ahold of us right away.

[-]

zrad603@reddit

check the time. This usually happens whenever the time servers on the DCs gets all goofed up.

[-]

SteveSyfuhs@reddit

Time does not cause trust relationships to go boink. Deleting secrets causes trust relationships to go boink. Every single thing falls back to the secret problem.

[-]

zrad603@reddit

and as a matter of policy, I try to make sure that any DC or hypervisor or anything important gets their time directly from time.nist.gov because I've seen some weird things where the clocks start to stray and all the servers start referencing the time on other internal systems and it starts to snowball until one day you come in, and every computer has the wrong time, and the trust relationship on everything is broken.

[-]

xendr0me@reddit

Only the DC holding FSMO roles should get external time sync. Every other DC should sync to that FSMO DC

[-]

zrad603@reddit

That is true... however what happened at a company I had just started out a few years ago:
DC1 = physical 1U server
VM1 = vmware server
DC2 = virtual machine running on VM1

VM1 had it's time source pointing to DC1
Somehow DC2 became the FSMO role (that was not intentional)
So VM1 started getting it's time from DC2, but DC2 was getting it's time from VM1, and VM1 was getting it's time from DC1.

So basically you had a circle of servers getting the time wrong slightly wrong, and everything spiraled out of control.

So from that point on, everything important just pointed at time.nist.gov

[-]

Frothyleet@reddit

I get why you'd do it, especially if you are in a position where you're firefighting and have to triage your time, but I'm never satisfied with a poor solution to bad configuration.

[-]

zrad603@reddit

and what is your suggestion?
whats wrong with pointing important servers at time.nist.gov ?

[-]

Frothyleet@reddit

You want to only have one local authoritative time source that is doing its syncing from an external source, and have the rest of your infrastructure pointing at that server. This ensures intranetwork time consistency.

[-]

Mountain-eagle-xray@reddit

We have hardware clocks. Do people still do that?

[-]

Frothyleet@reddit

Not host their own NTP source? Yes? Probably the vast majority of the infrastructure out there relies on external NTP sources. There are relatively few applications where a GPS NTP source would provide measurable benefit, let alone be any kind of necessity. Most eyeball networks just need to be within 5 minutes of the local infrastructure, if that.

[-]

Lower_Fan@reddit

Do all sync specially for a hy rid environment . When in the network it sync with the DC and out of the network it will sync with the ntp pool.

[-]

mnvoronin@reddit

To be more specific, the role in question is PDC Emulator, because timekeeping is one of its tasks.

[-]

bbbbbthatsfivebees@reddit

We use pool.ntp.org because time.nist.gov, while extremely reliable, is also a single source of failure if it goes down. The ntp.org sources are a bunch of pooled NTP servers with load balancing, so if one of their thousands of servers drop out of the pool, you're still getting reliable time info.

[-]

Smith6612@reddit

Only problem with that is if you have strict firewall controls for NTP, or if you have SNTP requirements. Servers leave and join the pool all the time. The NIST time servers are distributed across a few locations in the US, and are relatively static. They have an SNTP service as well if you reach out to them.

[-]

Fantastic-Shirt6037@reddit

Would rec npt pool tbh

[-]

Florida_Wrangler@reddit (OP)

Thank you for this. That's exactly the scenario I'm trying to avoid.

[-]

Vektor0@reddit

This happens when the laptops can't connect to the domain controller for long periods of time.

[-]

Denver80211@reddit

The 'long period of time' is linked to your password expirations, by the way. If your passwords are set to expire every 90 days, machines that are off the network for 91+ will drop off.

[-]

mnvoronin@reddit

OP literally said it's happening in the middle of the day, so that's not the case here.

Time skew or bad AD replication health, on the other hand...

[-]

Vektor0@reddit

Time of day doesn't matter.

[-]

JerikkaDawn@reddit

If it's happening all of a sudden in the middle of the day to a previously working machine, then it's not because the machine hasn't contacted a DC. That was the point of the statement.

[-]

Vektor0@reddit

That's not true. It can and does happen during the day.

[-]

mnvoronin@reddit

Bro, stop doubling and tripling down and inventing more and more unlikely scenarios, just admit you are wrong.

In order for the behaviour observed by OP to be caused by computer password expiry, PC should not be able to contact a DC for at least 30 days in a row. However, in order to receive the "trust relationship" message, PC must be able to contact a DC but not have a valid password to authenticate.

Now tell me, what are the chances for a laptop that have not been talking to any DC (but apparently able to access other local resources) for over a month and suddenly regained LoS in the middle of the working day? And what are the chances that it happened TWELVE times over a course of several months?

Now let's imagine that OP's network has a desynced DC that is listed as a secondary DNS. It's not contacted frequently, only when primary doesn't respond to DNS query for one reason or another. Laptop reaches secondary, tries to authenticate but is turned away because its password is no longer valid due to desync. Now THAT can easily happen randomly and multiple times over a course of few months.

[-]

Vektor0@reddit

Now tell me, what are the chances for a laptop that have not been talking to any DC (but apparently able to access other local resources) for over a month and suddenly regained LoS in the middle of the working day?

This often happens when users work from home or on the road. They will be disconnected from the domain controller for long periods of time, and then when they connect to the main network, the domain controller has expired the machine password, because it hasn't checked in in a long time. This can happen even if the user is working daily.

Also, OP didn't mention any "local resources," and there is no such thing as "LoS" in IP networking. Subnetting/NAT, firewalls, DNS, etc. all render the term "LoS" meaningless.

Now let's imagine that OP's network has a desynced DC that is listed as a secondary DNS.

This is also possible. If one of the domain controllers hasn't seen the machine in awhile, it will expire the machine's password.

[-]

mnvoronin@reddit

When it did, it usually meant the person had been out for an extended period and hadn't logged on. Which is not the case here, all of these instances have happened in the middle of the day.

This from OP's post, right there.

[-]

Vektor0@reddit

Yes, and?

[-]

mnvoronin@reddit

My point is that you keep inventing more and more unlikely scenarios.

That snippet from OP's post completely invalidates your theory of "users working from home or on the road for long periods of time". He quite literally offered this scenario and said "it's not the case here".

[-]

Vektor0@reddit

You are presuming that "out" means "working outside of the office" when he could just as easily have meant "out on vacation/leave/etc." and therefore not working. Just like you presumed that local resources were available, when OP did not specify that. You presume too much.

Regardless, whether a user is out on vacation, out working remotely, or in the office, the time of day it happens does not point to a specific cause. That piece of info is irrelevant.

If a domain controller does not see contact with a machine recently, it will not have a valid trust relationship with it. That is the only point I've made. There isn't enough info in the OP to diagnose the issue further with any amount of confidence.

[-]

mnvoronin@reddit

Why do you keep insisting on the "working outside of the office" scenario? OP specifically called it out and said it's not the case here, FFS!

And yes, if the user is working from the office, the time of day (or, rather, the fact that it's not happening the moment user logs in to their computer) does make a lot of difference.

Just like you presumed that local resources were available, when OP did not specify that.

Is assuming that there are some local resources in use, when the network has on-prem AD, is too big a stretch? Less likely than OP somehow missing the fact that the TWELVE users who lost trust relationship were just back from vacation or working from home for a long time? Missing it, despite calling it out in the post?

If a domain controller does not see contact with a machine recently, it may not have a valid trust relationship with it.

That actually doesn't happen. You can be away from the office for a year only using email and web apps, come back to the office and still have a healthy trust with a DC. Domain Controller does not expire computer account's password on its own accord.

[-]

Vektor0@reddit

Why do you keep insisting on the "working outside of the office" scenario? OP specifically called it out and said it's not the case here, FFS!

False. As stated in my previous comment, what OP meant by "out" is ambiguous.

But it doesn't matter anyway; I don't know why you're hung up on this. Maybe the users are physically in the office, or maybe they're not. Maybe the users are working, or maybe they're on vacation. All are possibilities. It is useless to make assumptions. There are issues that could cause this in any scenario.

[-]

bfodder@reddit

You're a very exhausting person.

[-]

Vektor0@reddit

You are presuming that "out" means "working out of the office" when it could just as easily mean "out on vacation/leave/etc." and therefore not working. Just like you presumed that local resources were available, when OP did not specify that. You presume too much.

It is pretty inarguable that if a trust relationship failed with a domain controller, that means the machine wasn't able to renew its relationship with that domain controller. There are a billion reasons why that might happen, among which are time sync, replication, and network connectivity. I haven't made any points other than that, so I don't understand what you're arguing against.

[-]

AmiDeplorabilis@reddit

This. Not that the other responses are wrong--they are definitely right--but this is an overlooked problem.

Just because a device was joined to the domain once doesn’t mean it's permanently joined. The device name will have to be deleted from the domain and the device itself removed before it can be rejoined.

I actually had to do this about 3mo ago.

[-]

Steerable-Octopus@reddit

I'm a bit skeptical. From my experience of GSS the keytabs which define the prinicipal seems to be permanent. Does modern Windows now require them to be updated like a x509 certificate? The way I understand it is that they use them to request tickets which are the authenticated communications identity of the principal and this certainly has a limited (but long) duration.

[-]

AmiDeplorabilis@reddit

Define "long".

In my case, it was several months of being powered down. Once it was powered back on and the time was correctly synchronized, it still wouldn't (couldn't?) rejoin the domain. I thought, once the time synced, the rejoin would happen, but it didn’t.

But that's an excellent question, and I don't know the answer.

[-]

Steerable-Octopus@reddit

Ah I think you're right after all. First the tickets themselves that the principal generates can vary (I've set them to just a few hours myself) but I think the default is like 48 hours. However apparently Windows has a "netlogon feature" that rotates the password of the principal every 30 days. his is likely what you saw in your troubleshooting!

[-]

AmiDeplorabilis@reddit

I didn’t do a lot of troubleshooting, just made some conclusions after dotting a few i's, unjoined and deleted, then rejoined.

But I promise, I won't let this go to my head!

[-]

Denver80211@reddit

If the laptops are off the network for an extended period of time, they will drop off. I believe that time is actually linked to password expiration in days. So if your passwords are set to expire every 30 days and a laptop is off for 31 days, it will drop off the network.

[-]

alphaxion@reddit

Is there anything cropping up in your security logs on your DCs about changes to AD objects? What does the event logs on the client say?

If you're not already, export your domain controller application, system, and security logs to a SIEM (such as an elastic stack) and build a dash that shows certain event IDs relating to active directory. It might help you to track down some patterns or uncover some processes that are happening that are transparent in the MMC level.

[-]

Subject-Jellyfish165@reddit

Check if your environment has a mix of DC servers. If you have Server 2025 with any other version, it will cause this behavior as workstations would not be able to reset their own machine passwords.

[-]

BrechtMo@reddit

interesing. do you have more info on this?

[-]

iamLisppy@reddit

May be related to what they're saying but this is one of the reasons we aren't going to WS2025 as of yet: Completely lost on a domain logon issue : r/sysadmin && RC4 issues : activedirectory

Posts are old now (nearly a year) so this may be fixed. Haven't looked since.

[-]

MidgardDragon@reddit

Time out of sync, replication not being completed properly on some DC's. If you have multiple DC's, can you find out if this happens only when they are hitting certain DC's and not others? Sometimes even just DNS issues or the PC's not connecting to the network fast enough.

[-]

Ikhaatrauwekaas@reddit

Time on the endpoint and switches etc.
DCOM errors , Domain health check everything needs to be checked

[-]

E__Rock@reddit

Have any automation to deactivate the device in Active Directory due to lack of use? That's pretty standard in most orgs. Haven't reached out and connected to the domain in a while? Deactivate list.

[-]

HotTakes4HotCakes@reddit

I have been wanting to do this but there's too many cases where it would become a problem. Now it just triggers an email out to the primary user if there is one, reminding them to connect in x amount of days or it'll be disabled, cc'd to their manager.

[-]

HotTakes4HotCakes@reddit

Going to repeat what everyone else said:

What I found was DNS issues and controllers that weren't speaking to each other properly.

[-]

xxbiohazrdxx@reddit

Almost certainly replication. You’ve got a dc somewhere that is out of sync or tombstoned but is still listed as a GC in your DNS

[-]

iamwayycoolerthanyou@reddit

Yep. Recently ran into this. It requires a rebuild. Six months since the last sync. File servers with different shares on each DC that are not replicated. Terrible architecture implemented by a dumbass. But I lucked out because they won't approve any work on it so I don't have to deal with it luckily.

It's sketchy because you don't know what changes are on that DC and what the effects of removing it will be until you take it off the network.

[-]

HotTakes4HotCakes@reddit

Terrible architecture implemented by a dumbass

I have a name, ya know.

[-]

SourceNo2702@reddit

Pretty good chance it’s DNS, these sorts of things tend to happen if you’re running a DNS server on both domains.

Next time it happens, get the IP of the workstation and run nslookup from the domain controller it’s supposed to be connected to ($env:logonserver stores the DC it’s currently on). If the domain controller is showing the wrong IP or no IP, it’s DNS.

At my org this issue turned out to be because we were sharing DNS between AD 1 -> AD 2, but not from AD 2 -> AD 1. Since we configured the workstations to use both as a DNS server it was occasionally grabbing DNS from AD 1 instead of AD 2, which would cause the workstation to immediately lose trust because AD 2 couldn’t read the DNS record on AD 1.

[-]

JoDrRe@reddit

Joke answer: you lose their trust because you don’t cherish them.

[-]

_araqiel@reddit

Is one of your domain controllers server 2025?

[-]

whydontyouwork@reddit

Are you reimaging laptops under the same environment? Is there computer name duplication happening ? Do you have an OU with names or serials in it that is causing a clash.

[-]

antiduh@reddit

NAC certs expired?

[-]

Commercial_Growth343@reddit

Technically speaking it happens when the password the computer is using no longer matches AD. But how? Usually it is because you work somewhere that checks for usage information and then disables and/or deletes the computer after so much time has passed.

The only other reasons I have seen this happen is when Windows itself does an automatic recovery of some form, and the old password is restored along with it. Or we are talking about a virtual machine that was reverted back to an older snapshot prior to the last password change, and thus the passwords are out of sync.

[-]

missed_sla@reddit

What os versions are your domain controllers running? Server 2025 doesn't play well with older versions without special setup.

[-]

MashPotatoQuant@reddit

Your environment needs AI agents. With AI agents this problem will go away.

[-]

Vektor0@reddit

You need to put an /s on comments like this so people know you're being sarcastic.

[-]

MashPotatoQuant@reddit

I am not being sarcastic. You don't have to let it go ham with automatic actions or write access but they absolutely are great for log analysis and finding root causes for issues like this.

[-]

Ozmorty@reddit

gif

More Ai agents could certainly make this the least of their problems.

[-]

Lukage@reddit

This is not good advice, nor an actual resolution.

[-]

Steerable-Octopus@reddit

Here's my perspective as a Linux admin:

Kerberos which LDAP uses relies on DNS and NTP. Windows machines will by default try to synchronize time with the domain controller so NTP in my experience and UEFI typically keeps the system clock going even while inactive so this type of failure is typically quite rare unless you deliberately syncronize with something like a third party NTP provider who has a completely different clock than your domain controllers (DCs).

The DNS for these types of records is internal, or at least should be internal. Likely on the DC itself but it can still cause problems:

For example if the DNS server is pointing toward an inactive DC which is still part of the domain in the DNS records but has been powered down or decomissioned. Or if you use another set of DNSes that forwards to the domain controllers.

The DNS server will maintain SRV records that the clients try to contact to find the correct domain controllers. And they use a form of load balancing between them. All DCs which are configured in the DNS to manage the domain must therefore be active and be able to serve the kerberos tickets for the system to be functional.

The kerberos tickets themselves typically have quite long durations so they will be valid for quite a long time even when the underlying dependencies are faulty. If you use them regularly, they will simply renew themselves with the DC.

If your network uses dot1x, this is an additional possible failure point in the stack. Dot1x in my brief experience is typically quite fragile. The devices won't even be able to communicate with the network stack at all if they're not able to authenticate their principal for this first.

The keytabs that define the machine prinicpal which your hosts uses shouldn't have an expiry by default so the underlying kerberos configuration client side shouldn't be the problem.

So yeah the failure points are essentially clock scew, DNS misbehavior or problems communicating with the DCs.

You could also check your firewalls to make sure that all the ports are reachable. Active Directory uses a myriad of TCP and UDP ports (most notably the RPC listener and RPC range) and it could be that they're misconfigured.

[-]

Florida_Wrangler@reddit (OP)

Wow, thanks everyone. I wasn't expecting this many responses so quickly. I'm seeing a lot of recurring themes and potential causes that I plan to investigate. I'll provide an update if I find anything.

As stated, I'm a jr admin, so I appreciate everyone taking the time to share their knowledge and experience.

[-]

rotfl54@reddit

Are you or someone else joining a devices with the same computername?

[-]

Florida_Wrangler@reddit (OP)

Actually, the first time this happened that was the case and I went through our inventory to make sure there weren't others.

Our naming schema was the issue there, which we've changed it since we have some flexibility with those.

[-]

Brather_Brothersome@reddit

that is directly related to time sync, make sure you have a truysted time source and asign it so everything is on the same time.

[-]

b4k4ni@reddit

What comes to mind: * Virtualization is syncing the time with the dcs. That shouldn't happen. Main DC with all roles should be the single time giver and w32time target a local ntp or your local ntp.org pools. Hyperv still has automatic time sync active with the guest if I remember right * Generally time issues somewhere, check it * No DC as DNS server - clients need that to register them self the right way. Make sure if vpn connected, the client can actually reach the dcs. Same for local network * Some old DC or other dcs having issues and won't sync right. There are a lot of tools to check the sync.

Check the client logs if it can connect to the ad or if there are any issues reported. There's a bunch of issues around that and Google entries to fix. Can be basically anything without looking at the system.

[-]

gzr4dr@reddit

A few good suggestions already. To confirm, are these on-prem or remote users? Assuming on-prem but if remote they will lose their trust if the computer password can't update/sync over a set duration (you'll need to lookup the exact duration). I'd go with a DC replication issue if these regularly are on-prem workstations.

[-]