How long does it take you to diagnose a network issue when your monitoring tool isn’t showing you why?

Posted by MilesAndMaps15@reddit | sysadmin | View on Reddit | 32 comments

Curious how others deal with this scenario:

Something is clearly wrong on a client’s network complaining, performance degraded, but the dashboard shows everything green or gives you no useful diagnostic info.

Specific example I keep hearing about: switch CPU or memory pegging but Meraki not surfacing those metrics, so you're flying blind until TAC escalates it.

How are you currently diagnosing these situations? Are you pulling local syslogs manually, using third party tools, or just opening TAC cases and waiting?

And how long does it typically take to go from "something is wrong" to "I know exactly what's wrong and how to fix it"?

Not selling anything, trying to understand if this is a common pain point or just a few edge cases. Thank you!

[-]

Dylantjes@reddit

[...] keep hearing about [...]

That’s where I stop listening 🖐️

I don’t want second-hand conclusions; I want evidence. What are you basing this on? What are you seeing? Where are you seeing it? Can you show me? Too often I’ve realized too late that I was fixing a user or skill issue instead of an actual technical problem.

“switch CPU or memory pegging”

Okay, show me the metric.

Either we improve monitoring by adding missing metrics, or we avoid wasting time chasing assumptions.

[-]

nuditarian@reddit

Can you ping it?

Can you ping 8.8.8.8?

Is it DNS?

Speed test?

Cable check?

term mon . . . wait . . .

Is it spanning tree?

Is it a switch loop and you don't use spanning tree?

Those take about 5 minutes to confirm.

[-]

Good_Ingenuity_5804@reddit

Great question. A few jobs ago I worked at a large F500 firm. Huge IT team but most people specialized in something narrow. Took forever to find, document and fix the issue. The in house expert had to not only fight the tech, but fight the beuracracy as the corporate network team and firewall teams did not get along. Finally after getting the teams together and coordinating all vendors involved it was some obscure conf error a true bug that caused Cisco and Aruba access points to lose packets. I don't recall the details, the guru needed to run wireshark to find it.

[-]

TheBestHawksFan@reddit

Having different firewall and network teams is so foreign to me, and I also worked at a F500 for several years in IT. I also know it’s not uncommon. It’s still crazy

[-]

MilesAndMaps15@reddit (OP)

That story is wild , 6 months for what turned out to be a vendor interop bug.

Quick question: do you think having better diagnostic data earlier would have shortened that timeline, or was the real bottleneck the politics between teams and getting everyone in the same room?

[-]

vivkkrishnan2005@reddit

30 minutes. Link went down due to 60 days non payment but traffic was moving very slowly. Failover issue with wire. Resolved

[-]

derpaderpy2@reddit

I agree with what someone said - reboot shit. I've had a few network issues resolved recently by rebooting switches or firewalls. After that...Pings to everything. I use pinginfoview to ping all switches and firewalls. If you see random packet loss that kinda comes and goes, it's probably a loop so lots of monitoring doesn't clock it. Then look at lldp first then Mac tables to see if any device is plugged in twice.

[-]

scobywhru@reddit

ASIC forward traffic, CPU is used for other tasks.

Port statistics/changes in traffic type or volume would likely provide more information. If you have current firmware the MS will show dropped management plain traffic if you see higher then expected statistics could point to devices behaving badly, loop/floods.

Even if you peg the CPU/RAM it could be the volume of the traffic and functions used on any platform and getting statistics for that can vary.

[-]

MilesAndMaps15@reddit (OP)

That's helpful context on the ASIC architecture.

So if port statistics and traffic anomalies are the real signal, how are you currently surfacing those in real time when something goes wrong? Is the Meraki dashboard fast enough to catch a loop as it's developing, or are you always chasing it after the fact?

[-]

scobywhru@reddit

Depends on what you are trying to do, a lot can be mitigated with design, DHCP guard, STP port settings, BPDU Guard, port flood settings. Architecture of the network matters as well, what is the core L3, L2, interface speeds. How much traffic is North South vs East West. Most people neglect things like pruning VLANs for instance especially when they run Meraki, but pruning broadcast domains end up limiting scopes of issues, or avoids them.

Once you run into issue is identifying what is happening, port stats high. Check historical usage and the port(s) that started sending more traffic. Mac flapping may indicate an issue with a loop or wireless devices moving between APs rapidly. Figure out where those MACIDs live and the path you expect them to take.

With switches port discards and MAC table/CAM usage are generally good to have but not available on Meraki.

[-]

tankerkiller125real@reddit

We started by throwing Meraki devices in the trash and getting devices with proper onboard metrics availability through SNMP (and a better feature set without the insane lock in and subscription prices).

Then we already use Graylogs for all of our servers, network geat, etc. anyway. And all the system monitoring (SNMP, etc.) gets converted to OpenTelemetry (custom tooling) so we can ingest and alert on it just like we do our customer facing applications.

[-]

Neggly@reddit

Monitoring tool?

[-]

MilesAndMaps15@reddit (OP)

What tools are you running for that? Curious what the setup looks like for a team your size and whether it covers Meraki specifically or just the broader network stack.

[-]

Neggly@reddit

Joke is that I don't have one. My users are my monitoring tool.

[-]

ProfessionalEven296@reddit

If the client is the first notice you get of an error, your logging should be improved. Ideally you’ll hear of the error, fix it and deploy, before the clients know there was an issue.

[-]

Solkre@reddit

Easy. I just fix DNS.

[-]

AdInevitable8483@reddit

Clearly you dont have deep observation tools and no logs monitoring with automatic alerts.

[-]

MilesAndMaps15@reddit (OP)

You're right, what tools are you running for that? Curious what the setup looks like for a team your size and whether it covers Meraki specifically or just the broader network stack.

[-]

AdInevitable8483@reddit

Loki grafana, ELK or graylog gorgeous logs. Zabbix for snmp + network deep layer. Traffic flow analysis real-time using ntopng. Use prometheus to export snmp

[-]

MyThinkerThoughts@reddit

Using your brain. The OSI model helps too.

[-]

sysadminbj@reddit

Check for stupid. Is there REALLY any performance degradation?
Check the ISP.
Reboot shit.
Reboot while giving the network gear the middle finger.
Check Solarwinds and see if I can't find anything that isn't right.
Call my network engineers and prey to whatever nonexistent deity that they don't screw anything up worse than it already is.
Tell my network guys that I've already rebooted stuff... Twice.
Escalate to that one guy in IT leadership that knows literally everything about everything and get him to fix things.

[-]

MilesAndMaps15@reddit (OP)

Haha this is painfully accurate, especially step 8. That “one guy” is carrying the whole operation.

Genuine question though: when that person isn’t available and you’re stuck at step 3, how long does it typically take to go from “something is wrong” to actually knowing what’s wrong? Hours? Days?

And when you finally find it, is it usually a weird edge case, or just the same handful of dumb configuration/loop issues over and over?

[-]

sysadminbj@reddit

Our issues are fixed by a good reboot most of the time, I’d say 65%. 25% ends up with ISP tickets for whatever reason. 5% config issues, and the rest are the edge cases where we have to open TAC cases.

Time to resolution varies from a few minutes to multiple days depending on the severity of the issue.

[-]

MilesAndMaps15@reddit (OP)

That breakdown is really helpful, thanks.

The TAC escalation cases are what I'm most curious about. When those drag on for days, is it usually because Meraki support needs time to get to it, or because even TAC can't easily see what's wrong without physical access to the device?

[-]