How long does it take you to diagnose a network issue when your monitoring tool isn’t showing you why?
Posted by MilesAndMaps15@reddit | sysadmin | View on Reddit | 32 comments
Curious how others deal with this scenario:
Something is clearly wrong on a client’s network complaining, performance degraded, but the dashboard shows everything green or gives you no useful diagnostic info.
Specific example I keep hearing about: switch CPU or memory pegging but Meraki not surfacing those metrics, so you're flying blind until TAC escalates it.
How are you currently diagnosing these situations? Are you pulling local syslogs manually, using third party tools, or just opening TAC cases and waiting?
And how long does it typically take to go from "something is wrong" to "I know exactly what's wrong and how to fix it"?
Not selling anything, trying to understand if this is a common pain point or just a few edge cases. Thank you!
Dylantjes@reddit
That’s where I stop listening 🖐️
I don’t want second-hand conclusions; I want evidence. What are you basing this on? What are you seeing? Where are you seeing it? Can you show me? Too often I’ve realized too late that I was fixing a user or skill issue instead of an actual technical problem.
Okay, show me the metric.
Either we improve monitoring by adding missing metrics, or we avoid wasting time chasing assumptions.
nuditarian@reddit
Can you ping it?
Can you ping 8.8.8.8?
Is it DNS?
Speed test?
Cable check?
term mon . . . wait . . .
Is it spanning tree?
Is it a switch loop and you don't use spanning tree?
Those take about 5 minutes to confirm.
Good_Ingenuity_5804@reddit
Great question. A few jobs ago I worked at a large F500 firm. Huge IT team but most people specialized in something narrow. Took forever to find, document and fix the issue. The in house expert had to not only fight the tech, but fight the beuracracy as the corporate network team and firewall teams did not get along. Finally after getting the teams together and coordinating all vendors involved it was some obscure conf error a true bug that caused Cisco and Aruba access points to lose packets. I don't recall the details, the guru needed to run wireshark to find it.
TheBestHawksFan@reddit
Having different firewall and network teams is so foreign to me, and I also worked at a F500 for several years in IT. I also know it’s not uncommon. It’s still crazy
MilesAndMaps15@reddit (OP)
That story is wild , 6 months for what turned out to be a vendor interop bug.
Quick question: do you think having better diagnostic data earlier would have shortened that timeline, or was the real bottleneck the politics between teams and getting everyone in the same room?
vivkkrishnan2005@reddit
30 minutes. Link went down due to 60 days non payment but traffic was moving very slowly. Failover issue with wire. Resolved
derpaderpy2@reddit
I agree with what someone said - reboot shit. I've had a few network issues resolved recently by rebooting switches or firewalls. After that...Pings to everything. I use pinginfoview to ping all switches and firewalls. If you see random packet loss that kinda comes and goes, it's probably a loop so lots of monitoring doesn't clock it. Then look at lldp first then Mac tables to see if any device is plugged in twice.
scobywhru@reddit
ASIC forward traffic, CPU is used for other tasks.
Port statistics/changes in traffic type or volume would likely provide more information. If you have current firmware the MS will show dropped management plain traffic if you see higher then expected statistics could point to devices behaving badly, loop/floods.
Even if you peg the CPU/RAM it could be the volume of the traffic and functions used on any platform and getting statistics for that can vary.
MilesAndMaps15@reddit (OP)
That's helpful context on the ASIC architecture.
So if port statistics and traffic anomalies are the real signal, how are you currently surfacing those in real time when something goes wrong? Is the Meraki dashboard fast enough to catch a loop as it's developing, or are you always chasing it after the fact?
scobywhru@reddit
Depends on what you are trying to do, a lot can be mitigated with design, DHCP guard, STP port settings, BPDU Guard, port flood settings. Architecture of the network matters as well, what is the core L3, L2, interface speeds. How much traffic is North South vs East West. Most people neglect things like pruning VLANs for instance especially when they run Meraki, but pruning broadcast domains end up limiting scopes of issues, or avoids them.
Once you run into issue is identifying what is happening, port stats high. Check historical usage and the port(s) that started sending more traffic. Mac flapping may indicate an issue with a loop or wireless devices moving between APs rapidly. Figure out where those MACIDs live and the path you expect them to take.
With switches port discards and MAC table/CAM usage are generally good to have but not available on Meraki.
tankerkiller125real@reddit
We started by throwing Meraki devices in the trash and getting devices with proper onboard metrics availability through SNMP (and a better feature set without the insane lock in and subscription prices).
Then we already use Graylogs for all of our servers, network geat, etc. anyway. And all the system monitoring (SNMP, etc.) gets converted to OpenTelemetry (custom tooling) so we can ingest and alert on it just like we do our customer facing applications.
Neggly@reddit
Monitoring tool?
MilesAndMaps15@reddit (OP)
What tools are you running for that? Curious what the setup looks like for a team your size and whether it covers Meraki specifically or just the broader network stack.
Neggly@reddit
Joke is that I don't have one. My users are my monitoring tool.
ProfessionalEven296@reddit
If the client is the first notice you get of an error, your logging should be improved. Ideally you’ll hear of the error, fix it and deploy, before the clients know there was an issue.
Solkre@reddit
Easy. I just fix DNS.
AdInevitable8483@reddit
Clearly you dont have deep observation tools and no logs monitoring with automatic alerts.
MilesAndMaps15@reddit (OP)
You're right, what tools are you running for that? Curious what the setup looks like for a team your size and whether it covers Meraki specifically or just the broader network stack.
AdInevitable8483@reddit
Loki grafana, ELK or graylog gorgeous logs. Zabbix for snmp + network deep layer. Traffic flow analysis real-time using ntopng. Use prometheus to export snmp
MyThinkerThoughts@reddit
Using your brain. The OSI model helps too.
sysadminbj@reddit
Check for stupid. Is there REALLY any performance degradation?
Check the ISP.
Reboot shit.
Reboot while giving the network gear the middle finger.
Check Solarwinds and see if I can't find anything that isn't right.
Call my network engineers and prey to whatever nonexistent deity that they don't screw anything up worse than it already is.
Tell my network guys that I've already rebooted stuff... Twice.
Escalate to that one guy in IT leadership that knows literally everything about everything and get him to fix things.
MilesAndMaps15@reddit (OP)
Haha this is painfully accurate, especially step 8. That “one guy” is carrying the whole operation.
Genuine question though: when that person isn’t available and you’re stuck at step 3, how long does it typically take to go from “something is wrong” to actually knowing what’s wrong? Hours? Days?
And when you finally find it, is it usually a weird edge case, or just the same handful of dumb configuration/loop issues over and over?
sysadminbj@reddit
Our issues are fixed by a good reboot most of the time, I’d say 65%. 25% ends up with ISP tickets for whatever reason. 5% config issues, and the rest are the edge cases where we have to open TAC cases.
Time to resolution varies from a few minutes to multiple days depending on the severity of the issue.
MilesAndMaps15@reddit (OP)
That breakdown is really helpful, thanks.
The TAC escalation cases are what I'm most curious about. When those drag on for days, is it usually because Meraki support needs time to get to it, or because even TAC can't easily see what's wrong without physical access to the device?
sysadminbj@reddit
Well, we’re not a Miraki shop, so I’m just generalizing, but our TAC escalations have pretty quick SLAs. I’m certainly not going to wait 48 hours for a TAC engineer to get assigned.
Darkhexical@reddit
25% isp? Is the Internet that bad?
sysadminbj@reddit
Eh. Might be less, but not that far off. We have A LOT of routers out there and some have shitty broadband or even shittier cell modems.
sysadminbj@reddit
Our issues are fixed by a good reboot most of the time, I’d say 65%. 25% ends up with ISP tickets for whatever reason. 5% config issues, and the rest are the edge cases where we have to open TAC cases.
Time to resolution varies from a few minutes to multiple days depending on the severity of the issue.
Traditional-Hall-591@reddit
I put things into the slop generator and it makes me a fresh batch. Like how a 404 error is always asymmetric routing.
discgman@reddit
Pegging sucks
Muted_Idea@reddit
Not selling anything… yet
lowlybananas@reddit
I start by troubleshooting