Puzzling DHCP Issue - Assistance Requested
Posted by Jet_mech91@reddit | sysadmin | View on Reddit | 27 comments
I work as a sysadmin for a moderately sized environment (\~1000 systems). We have several DHCP scopes in our domain, with one being a build VLAN for imaging new systems and the rest being various user scopes. Our Domain Controllers double as our DHCP and DNS servers for the entire domain.
Normally we image workstations on the build VLAN, from which they join our domain and get drivers/software/updates through the task sequence and MECM, before we move them over to our primary user VLAN (802.1x enabled) to receive a DHCP lease. This has historically worked fine for years, but as of last week weve suddenly found that newly imaged systems are no longer receiving DHCP leases on the primary user VLAN.
We've confirmed that when connected, we can track the device MAC across the network devices up to the switch bordering our DHCP server, so the requests seem to be getting out there. Our two load balanced DHCP servers are showing hits for the workstation MAC addresses for lease requests on the build VLAN, but zero hits at all for the primary user VLAN after switching.
DHCP for the primary user VLAN works for all existing systems in the environment, even after I released the lease on a test system, ensured it was removed from DHPC and DNS, and left it powered down until it fell off the switch MAC Address Tables. Expanding on this, newly imaged devices that are given a static IP on the primary user VLAN are subsequently able to pull new DHCP leases when the static IP is deconfigured.
The only error message of note I have found is a DHCP event viewer log that shows error 0x79, however based on my reading that suggests either our scopes are full (theyre not), there is an IP conflict (not sure how this would be relevant for a new device on DHCP), or our network settings are "misconfigured" (dhcp scope settings look correct and do not appear to have changed relative to before/after the issue started. The only recent change to our knowledge is a GPO update that enabled Windows Defender Firewall on our servers with domain policy traffic set to Allow All Inbound/Outbound (Public and Private are set to block inbound default). All other administrative entities (network, forest level) deny making any changes on their end.
Due to separation of duties and red tape from security policy, I am not currently approved to utilize packet sniffing software to try and trace the DHCP traffic.
Any ideas or thoughts as to why only one out of 5 DHCP scopes have decided to stop leasing brand new devices are greatly appreciated.
mrhobby@reddit
Can you pcap on both dhcp server and previous hop router/switch? You mentioned that the request never shows up on the server. Find out where exactly it is lost and what is dropping it.
Jet_mech91@reddit (OP)
I can likely swing permission to temp run wireshark or something on the DHCP servers come monday, but getting the network team to run a pcap on their end is gonna be like pulling teeth.
nlbush20@reddit
Are they getting an APIPA address? If you’re using a failover relationship for the two DHCP server, make sure the failover relationship is healthy. Ours fall out of sync occasionally and we have to restart the DHCP Server services on both servers. The main symptom for us when this happens is newly imaged computers cant get an IP address.
Jet_mech91@reddit (OP)
We are indeed getting an APIPA. Failover was healthy when I checked and each server reports roughly 30% of their allotted IPs are available. Newly imaged systems for us grab a DHCP lease on the build network for us when they pxe boot and hold that lease through the entire task sequence. The same DHCP servers manage the build and user VLANs.
St0nywall@reddit
How are your 2 DHCP servers load balanced, spit scopes or just the same scopes on two different servers?
Jet_mech91@reddit (OP)
Same scopes on two DHCP servers configured for 50-50 active load balancing.
St0nywall@reddit
Do you have ip helper setup on all of your switches to point to both DHCP servers?
Jet_mech91@reddit (OP)
I have been assured by the network team that the IP Helpers are configured properly.
Mac-Gyver-1234@reddit
In computer sciences and other sciences we only believe in the saying of others if their say has been peer reviewed. Peer reviewing is an extensive process that certifies that ones say stands on the grounds of science.
What I want to tell you by that:
Never believe, always know. Check and test yourself. Unless you are presented data that stands on the grounds of science.
The world is not flat, the earth is not in the center of the universe and your problem only be solved if you check and test by yourself by the rules of science. Objective, reproducable and valid.
techretort@reddit
Damn, you hit on something that's been banging around in my head for a minute for my work things. I keep asking questions, getting an answer that feels wrong, and have to go with it because I'm relying on a different team.
Peer review might be the way to push back without being TOO much of a dick about it..
Mac-Gyver-1234@reddit
In science it is not one peer reviewing, but at least a dozen. The peers are credible and well known individuals.
Cultural-Horse-762@reddit
I work with so many technicians that would rather think through possible problems before actually looking at the problem.
Cormacolinde@reddit
Do you actually see requests from the same MAC appear on both server logs around the same time?
Jet_mech91@reddit (OP)
We do from the build vlan: one server fields the lease and the other predictably throws an error indicating the lease is being handled by another server. Once we move to the user VLAN we dont see any hits from that device MAC in the DHCPsrv logs on eithet DHCP server.
caspianjvc@reddit
Have the network team done anything recently. Get them to send you a copy of the switch config.
St0nywall@reddit
Try vetting the DHCP works by using this tool DHCPfind
https://roadkil.net/program.php?ProgramID=10&Action=NewOSID&DownloadVersion=12
xxbiohazrdxx@reddit
Almost assuredly something with your nac/802.1x.
Check certs and your switch logs to see if ports are being security disabled
Jet_mech91@reddit (OP)
The device certs are in place on newly imaged devices and look good from what I can tell. Ill see if the network team can pull anything from the logs on Monday regarding device authentication.
If 802.1x was the issue, Id reckon getting assigned a static IP wouldnt fix DHCP afterwards. The imaged devices are domain joined and pulling GPOs successfully. Though to be honest, I only have surface level knowledge regarding 802.1x implementation
LaxVolt@reddit
We had an issue with 802.11x not authenticating at the machine level. Our problem was with the fact that we still use mschap. In our case if a user was logged in when connecting to the network they would authenticate properly. However if a user was disconnected or a new computer was placed on the network then they couldn’t log in. I ended up having to modify a policy with gpo to fix. I can’t remember the exact policy and I’m on vacation at the moment
Lower-Restaurant-272@reddit
KB5077181 caused us headaches similar to this. It was the February update and this aligns to your timeframe.
gptbuilder_marc@reddit
A DHCP scope that breaks only on new machines moving from the build VLAN to the user VLAN after years of working is a very specific failure mode. The timing of when it broke matters as much as what broke. What changed in the environment last week?
Jet_mech91@reddit (OP)
The only thing we know changed after asking around (none of the other teams follow any sort of official change management process so there's no reliably accessible log for us) is a GPO change at the end of the week prior that enabled Windows firewall on our servers. We've mostly dismissed this in the short term as any changes to that would need to wait for a maintenance window, but I've wondered what context DHCP broadcasts are registered as on the firewall. Everything is allowed inbound for now on the domain profile, but not the private and public profiles.
With any luck, I can get approval for a temporary wireshark installation to get a better look at the traffic. That said, I realize I neglected to check the firewall logfiles on the DHCP servers.
geegol@reddit
Potentially a conflict is happening within DHCP. However, without the network documentation there is only so much we can provide.
sembee2@reddit
I had this a couple of weeks ago and it was corruption in the DHCP database. It only showed when I restarted the DHCP service and the scope was full of bogus leases (bad characters). The event log was right in the scope was full.
irsyacton@reddit
Does the dhcp server do conflict detection? Also, what if someone set an IP from the dhcp scope range as a manual IP on their device, but didn’t set up the static lease/reservation?
rivkinnator@reddit
You're going to have to talk with your team because that is the next step so you can see what's actually happening on the network and look at the transactions that are taking place.
Wonder_Weenis@reddit
you should really really be blocking outbound private ranges