Bizarre problem? Resetting Ethernet on one Endpoint fixes every Endpoint.

Posted by hetter12x@reddit | sysadmin | View on Reddit | 59 comments

Hello,

I started my work as a sysadmin around 1.5 years ago. To this day, i didn't stumble into any bigger problems i couldn't fix, however this one, to me, is not logical in any sense whatsoever. Description:

My company has a network with Fortigate, endpoints, a VPN set to connect other departaments to our main LAN, and a VPN connecting us to our subcontractor's network (so we can access their apps through web). Everything was fine, all the policies set, working flawlessly.

One day everyone lost access (ERR_ADDRESS_UNREACHABLE)- first thought was that the subcontractor has some issue, I called and everything was fine on their part. Then i went through Fortigate logs, I saw that all the trafic to their network is accepted and passes, however one thing caught my eye that i haven't seen before - attempt to connect to any of their sites sends 100+ MB's, and receives 4-6GB's. I tried changing policies, resetting Fortigate, other fixes that came to mind, and the dumbest idea worked - i turned my Ethernet adapter on and off, and it worked.

I was about to write a script and run it on every PC, however i got a call that everything works now. So, it appears that resetting the Ethernet adapter on one PC fixes the problem on every computer in the network. What's even more weird, it appears again after 10-15 minutes.

I suppose something clogs up the connection? But it's weird, cause it only appears to be the problem when connecting to said subcontractor's network, every other site (that workers are allowed to enter) works flawlessly, our internal webserver works without a problem too. And the worst part is that the issue is so specific i have no clue where to look for solutions.

If you know what might be the cause and how to fix it permanently, let me know. Thanks in advance!

[-]

AniBMagal@reddit

This smells like STP.

[-]

catherder9000@reddit

Proxy ARP is on between you and the contractor. Turn it off so they stop flooding your network with ARP requests.

[-]

neploxo@reddit

Like others have said, the burst of packets points to a loop condition without spanning tree running to prevent it. Some other data gathering will be helpful.
1) How many endpoints do you have?
2) What kind/how many switches do you have? Are they connected with multiple links?
3) Are you using VLANs?
4) Are you using spanning tree?

Follow OSI here. First look at all connections. Is the network small enough that you can unplug everything but one system and verify the problem is gone then start plugging in one at a time until it starts again? The problem might not be physical, but this might be the quickest way to isolate where the problem is.

It really sounds like you've got a loop of some kind and spanning tree is causing a blocked port. Every time you unplug something the switch could be triggering a new spanning tree reconfiguration and making it temporarily work until the loop is detected again and the block is reasserted. Normally this would only happen with multiple switches connected together, but it could potentially happen with a faulty PC doing something weird in multiple VLANs.

[-]

hetter12x@reddit (OP)

50 endpoints, 2 servers
2 Fortiswitches + 2 zyxels + cisco switch and 1 fortiswitch and zyxel in each departament
Yes
No idea what is a spanning tree, can you elaborate? Im not the first sysadmin here, so it might be configured/made already and i just don't know how to check

[-]

neploxo@reddit

Spanning tree is an automated protocol switches use to detect and prevent loops in the network. You might see it abbreviated as ST or other terms like PVST (per-VLAN spanning tree). Look at your switch configs for commands like 'spanning-tree portfast'. If you see that on any port connected to another switch I recommend you disable the command. ('no spanning-tree portfast' is the cisco command)

[-]

Sagail@reddit

Another thought, physical switches aren't the only thing to do STP. Virtual switches for docker and VMs can also

[-]

Sagail@reddit

Yeah I agree stp related. Perhaps they have two root switches

[-]

mountain_bound@reddit

Check the Fortiswitch loop and STP BPDU guard to see if they're enabled. Ten minutes is a totally viable setting, 5 is default.

[-]

luminousfleshgiant@reddit

You need to do a packet capture during the break and during the fix and see what is actually going on on your network. There have already been many plausible causes listed in this thread that you've dismissed without evidence.

[-]

VviFMCgY@reddit

Sounds like spanning tree

[-]

Gloomy_Car_53@reddit

smells like a switch loop

[-]

sitesurfer253@reddit

Does the IP that this adapter you keep flipping also line up with a very important IP like a gateway or something?

The DHCP scope you are using might overlap with an important service that is static.

[-]

Competitive_Run_3920@reddit

This was my thought as well. Maybe at some point the subnet was expanded and the gateway IP is still available in the DHCP pool.

[-]

hetter12x@reddit (OP)

I forgot to mention, if i restart the adapter on any PC in the network it works, not only mine. Gateway points to the fortigate, same for DNS (secondary DNS is our Domain Controller). IP's are statically set on every endpoint and server, no DHCP

[-]

opotamus_zero@reddit

No DHCP at all? With the high utilization you mentioned and this, I wonder if you're inadvertently arp tunnelling the subcontractor, or getting some other arp snafu from the vpn client. Sniff arp packets, arp -a etc. and see if you find any strange whohas or replies.

[-]

billnmorty@reddit

My thoughts here as well, check IP scopes, but also anyone possibly install a DHCP server on a machine that’s taken one of those important IPs? A VM that got spun up to run local Claude or other silly Ai thing?

[-]

hetter12x@reddit (OP)

No, we strictly prevent any new installs, modifications or even visiting websites that aren't on a whitelist (so only the things needed for work).

[-]

Loveangel1337@reddit

To check if it's that or not, I would try and reset the adapter on another computer next time, check if the important bit is a reset or that adapter

I was thinking some sort of broadcast storm situation, but I don't see how an adapter reset would end it.

[-]

littleredryanhood@reddit

[-]

Mac-Gyver-1234@reddit

Some people have old consumer ethernet switches with low mac address numbers, which equals their (rapid) spanning tree protocol bridge id. The lower a bridge id, the more likely it becomes the root bridge of the whole layer 2 network if security measures are missing.

It happened to me that some random dude plugged in their 5 port 100 mbits home ethernet switch to allow more people in the room to use the network. It led to a side wide network outage as the whole network was rerouting over the tiny switch.

[-]

blahblahcat7@reddit

I would actually like to see how you are subnetted, or if you're using vlans. Are all of the departments on the same network? Is this the same network that you're your lan is on?

[-]

NsRhea@reddit

Subcontractor network probably has proxy ARP enabled when it shouldn't be.

Endpoints use ARP to find mac addresses of other devices on the same network. If an endpoint wants to talk to a device outside it's subnet it doesn't ARP, it just sends the packet to the gateway.

When you have proxy ARP on it flips this around. If a router / firewall has proxy ARP enabled it's still listening for local ARP requests. If it sees an endpoint ARPing for an address that isn't local but the router knows the address, it essentially lies. It will reply to the ARP request with its own Mac address. The endpoint then, thinking it's talking to the correct endpoint is instead sending it's traffic directly to the router's Mac.

When someone tries to access the subcontractor network then this loop is formed. An endpoint is connected to the router which lies and all data is being sent to the wrong device.

When you're resetting the network connection then a GARP is sent out. It tells the other devices "Hey my IP is X and my MAC is Y." which fixes your problem - temporarily.

The problem returns because you're still looping and eventually those false ARPs take over.

Most OS and firewall default settings hold ARP settings for 10-20 minutes and switches default to 15 minutes for MAC tables which is what drew my attention to the problem returning.

[-]

Adam_Kearn@reddit

Sounds to me like DHCP expiring leases and you have the scope options configured to use the IP of this workstation etc which is conflicting

[-]

Sagail@reddit

Frankly not enough data points. You flipping your int to up once is not conclusive. Have you been able to repeat this numerous times

[-]

hetter12x@reddit (OP)

Yes, i mentioned it appears again after 10-15 minutes. All i have to do to fix is do the same thing - reset Ethernet adapter on any PC in the network.

[-]

Sagail@reddit

What happens if you change the mac on your interface? MAC collisions do happen. It's rare though.

[-]

Sagail@reddit

Oh you said any interface sorry I missed that.

It def seems like layer 2 related perhaps spanning tree related. Like you have 2 stp roots maybe

[-]

Godcry55@reddit

Layer 2 issue. Investigate switch ports thoroughly.

[-]

HappyVlane@reddit

You don't say what doesn't work.

Can you reach your gateway?
Does DNS work?
Does internal traffic work?
Does traffic in the same broadcast domain work?
Does traffic leave the gateway?
Can the FortiGate reach resources? etc.

There is very little actual information here.

[-]

hetter12x@reddit (OP)

I didn't explicitly say that it only affects subcontractor's network, thought it's obivous from context but now when i look back it may be not that clear, my apologies. It only happens when connecting to subcontractor's network, we don't use DNS for it just plain IPs, but yes overall DNS works (Fortigate handles it).

Internal traffic works, if i understand "same broadcast domain" correctly then yes it does, Traffic leaves and comes back (atleast that's what i see in fortigate logs)

[-]

HappyVlane@reddit

If it only applies to one network, one that isn't under your jurisdiction at that, then you need to communicate with the subcontractor too.

Look at the actual traffic flow both during an OK period and a non-OK period (traffic sniffer on the FortiGate and Wireshark on a client). That should tell you relatively quickly how traffic behaves.

[-]

Anthropic_Principles@reddit

Question #1 What changed?

That's it really. Something must have have changed, either at your end or most likely at the subcontractor's end.

Question #2 Does the subcontractor have any other clients with the same issue?

[-]

hetter12x@reddit (OP)

Well, the only thing that comes to mind is Fortigate updates, which we do as soon as they release them (so around each month). It was like more than a week since we did it, and the issue popped up only recently. I also started hosting an internal webserver with docker + ngnix, but i don't think it interferes with anything.

[-]

Tripl3Nickel@reddit

Disconnect your web server - does the problem go away / stay away?

[-]

hetter12x@reddit (OP)

It doesn't change anything unfortunately

[-]

ZealousidealTurn2211@reddit

I wouldn't exclude the possibility of a FortiBug, I've seen some really inexplicable behavior from our Fortigate. The specifics escape me (I'm not our network team I just noticed and reported the issue and helped troubleshoot it), but the example that comes to mind was if a specific link was configured with its correct subnet mask the appliance suddenly routed all traffic through that link.

[-]

Pusibule@reddit

That's interesting.

A lot of things could be the culprit, even outside the adapter you reset to fix it.

If is not an evident ip misconfiguration or conflict, just use wireshark to see what is really happening in the network to get a better idea.

If you switches are managed, take a look to the mac address table of anything involved, until the router.

And don't forgot to check there aren't any cabling shennenigans.like forgotten dumb switches and loops.

[-]

hetter12x@reddit (OP)

Cables seem to be all fine and unchanged, will be running wireshark soon as i got lots of work to do

[-]

ITRabbit@reddit

Do your computers have both wireless and Ethernet? Can you disable the wireless on the computer. Sometimes it can bridge 2 networks.

[-]

hetter12x@reddit (OP)

Every wireless adapter is turned off + a GPO to block wireless connection

[-]

Vegetable-Ad-1817@reddit

Spanning tree - but the traffic is odd, could also be MSS being too small, but sounds more spanning tree - resets after 300 seconds kind of thing. Wireshark the lan and see what comes up

[-]

Desnowshaite@reddit

Based on your description it is not that but it sounds similar to having a circular connection on the network. Similar happened on our network when a tech found an unconnected network cable while troubleshooting a computer and plugged it in not knowing the other end of the same cable was already plugged in to the same switch. It flooded the network with excessive traffic. In that case disconnecting the cable was not enough, the switch also had to be restarted to return to normal.

[-]

Mindestiny@reddit

My first guess as well. Sounds like something's creating a network loop.

[-]

WrongUn@reddit

This ☝🏻. I've seen something similar to what OP describes a few times in my career, almost always because someone inadvertently created a loopback.

[-]

Prestigious-Board-62@reddit

This smacks of an IP address conflict. Sounds like something is taking the gateway IP address, or maybe some device is sending ARP probes that is misconfigured and looks like an IP conflict to other hosts on the network.

The reason repeating the cable works is because the first thing a host will do when it's plugged in is ARP for its gateway, which will update the ARP cache for everything else on the network. ARP is broadcast remember.

Next time instead of reseating a cable, clear the ARP cache on a host and try to ping the gateway. If you confirm IP conflict this way, watch for the 10-15 minutes for it to happen again and check the ARP cache for the MAC address causing the conflict.

[-]

tommy-turtle@reddit

It might be a long shot, but I had a similar problem to this, where one PC would kill the the network, and it turned out that its network card actually randomly would lose its MAC address and defaulted to 00:00:00:00:00 so traffic from that endpoint was literally being rebroadcast across the whole LAN. Took a bit of work to find it because it was so intermittent, but it was when I saw the smb packets of a random word document getting saved as a broadcast in wireshark did the penny drop

[-]

hetter12x@reddit (OP)

I just checked, and MAC seems to be fine. I added info above, if i restart the adapter on any PC in the network it works, not only mine. And on most PCs i checked MAC stays still.

[-]

USarpe@reddit

At this point you should start to know and not to guess, make a trace with Wireguard, to roll the dices to guess what it is, is waisting time and energy.

[-]

Cyber_Faustao@reddit

I'd investigate IP address conflicts (seriously re-think your no-DHCP policy), IP range conflicts, and Layer 2 loops (including ones involving the VPN itself if that is also Layer2 and not Layer3). Also, is STP properly set up on both networks (in case of a Layer2 VPN)?.

If it is not a configuration error, then I'd probably wager on a faulty switch or faulty switch port somehow causing this.

Does the workaround you found reliably fix the issue? If so, try moving that one laptop to a dedicated VLAN and see what happens to the rest of the network.

[-]

dylwig@reddit

How bizarre. You could start a wireshark capture, let the fail happen, reset adapter, then let it happen again? Thats a needle in a stack of needles, but I wonder if you’ll start seeing resets, syns, etc. As far as the vendor’s “everything’s fine on our end”, without an explain at what “fine” means, doesn’t give me much confidence either. Sometimes just asking for details of what they checked can you get in front of the person responsible for checking. You could also use diag sys session list (or whatever your FGT firmwares CLI flavor is) and filter to your vendor’s tunnel, though still an ugly way to diagnose.

I assume that is a site to site between you two, and probably your LAN allowed to a few specific addresses on their side? That 4-6gbs is eye-catching for sure. Resetting a NIC fixing the issue and reoccurrence after 10-15 minutes sounds like a phase 2 rekey or renegotiating, or maybe MAC\ARP table flush. Good luck, I’d be interested in what you find!

[-]

hetter12x@reddit (OP)

Gonna play around with logs today a bit more, will let you know what i find. We ordered a consulting from them, as even if it's a problem on our part, they are definetly much more experienced than our IT department.

[-]

USarpe@reddit

First get the IP of your device and than check if something else use it. Change your IP and where does it come from?

[-]

hetter12x@reddit (OP)

IP's are statically set for everything in the network, we don't use DHCP. So there is definetly no conflict in IP adressess

[-]

sambodia85@reddit

Not using DHCP would make it more likely you have conflicts.

[-]

hetter12x@reddit (OP)

Propably for a bigger company that would be true, but it's a small one and i checked everything - no conflict of IPs anywhere.

[-]

sambodia85@reddit

I disagree, DHCP saves so many headaches in the long term, there is no size too small for DHCP.

But I don’t think it’s your problem.

I’m thinking it’s a layer 2 issue, like a broadcast storm or arp poisoning, like proxy arp playing up. I’d be looking at the MAC address tables in the switch and see what it’s thinks everything is.

Only other thought was NAT port exhaustion, but I would think that would take a little while longer to resolve than resetting a NIC.

[-]

Kal0psia_@reddit

Similar issue but totally different tech setup. At home, I'm running a Dell dock which is the same one used in the office. I randomly started having issues with my network at home up (wired and wireless) and the the whole network would go down.

In the end, I unplugged the Ethernet on my Dell dock randomly and instantly the whole network came back online. Plugged back in, network down again. Dodgy Dock/port, replaced the unit and has been fine.

Still not sure of the root cause, no updated drivers or firmware fixed it. Just replaced it and moved on but was also the first time I've experienced that type of issue.

[-]

hetter12x@reddit (OP)

Happens on every PC so i don't think it's the issue - these are the same and different models of AIO and every one works the same.

[-]

Anxious-Community-65@reddit

Resetting one Ethernet adapter fixes it because it flushes the ARP table and forces a fresh connection state, which temporarily clears whatever is saturated. a routing loop or asymmetric route between your Fortigate VPN and the subcontractor's network is my doubt why this is happening. Traffic goes out clean but the return path sends it back amplified. Check if there are any duplicate or overlapping routes on the Fortigate for that subnet, and ask the subcontractor if they have any multicast or broadcast heavy services running.