I stopped overengineering monitoring and I’m wondering if anyone else feels the same
Posted by Important-Bug-6709@reddit | sysadmin | View on Reddit | 18 comments
I’ve been managing a bunch of small Linux setups over the last years and at some point I realized I kept doing the same thing over and over again I would start with something simple just to know if my servers were fine and somehow it always ended up turning into a whole ecosystem of tools dashboards alerts configs and things I barely touched again after setting them up and the funny part is that when something actually breaks I don’t even go to the dashboards I just ssh into the machine and check logs directly because it’s faster and clearer so I started wondering if I’ve been overcomplicating this whole thing or if this is just how everyone ends up doing it when they scale a bit does anyone else feel like monitoring tools slowly become something you maintain more than something you actually use
Ssakaa@reddit
Monitor everything in a system that lets you correlate things in a coherent way, i.e. throwing together graphs of metrics against one another, etc. Alert on NOTHING that you don't have a documented need to address out of hours and a known "what to do" documented too.
I don't want to try to figure out from logs on a box I can't reach, why I can't reach it. I also don't want to try to figure out from logs on a box how an attacker that got elevated rights did so when they've now had the ability to tamper with logs. Aggregate logs somewhere you can correlate them with metrics.
And FFS, logs, systems, etc. should all be in UTC. Nothing worse than variable timezones not getting caught in log parsing.
One_Contribution@reddit
I try avoidng CPU/RAM alerts like the plague, and opt for monitoring that services and whatnots respond as expected.
Ssakaa@reddit
Response latencies are amazing metrics.
SudoZenWizz@reddit
Have basic monitorng in place for cpu, ram, disks, services, processes and hardware. Keep the threaholds above the state where everything works properly and in time adjust as needed. The same solution you can use also log monitoring and metrics. This is will give overview for both applications running there and the systems themselves.
ilbicelli@reddit
I use Zabbix. I've set up templates to lonitor what really matters. I watch dashboard maybe once a week. Every P1 and P2 alert flows in our ticket system.
chickibumbum_byomde@reddit
common end of the road realization, monitoring usually starts simple, then slowly grows into dashboards, exporters, alerts, databases, and tooling that mostly exists to support more tooling. Meanwhile, when something actually breaks, people still SSH into the box because it’s the fastest way to understand reality and because of habits...doesn’t mean monitoring is useless, it just means many setups drift past the point of practical value.
for a medium sized enviro, most useful monitoring is usually, knowing something is broken, having enough context to start quickly, avoiding noise. beyond that, it’s easy to build a system that feels impressive but mostly creates maintenance work.
The important distinction is whether the monitoring helps you make decisions faster, or whether it became another thing you have to operate. A lot of homelabs and smaller infrastructures cross that line without noticing.
Attack-Chihuahua-85@reddit
I like nagios. Call me old man. And yeah I messed around with log aggregators but in my case I’m not troubleshooting a planet scale distributed system, so it’s now just an rsyslog instance that we send logs too so I have a them somewhere other than the box in question. Simplest solution is often the most elegant also, which makes me happy.
naitsirt89@reddit
You're a dirty old man.
Feel better?
Attack-Chihuahua-85@reddit
My back hurts. God damnned kids
Hot-Cress7492@reddit
Here’s how monitoring should work. Get your system into a known good state and scan it and monitor everything that isninportant (cpu, mem, interface, whatever other parameters are important) and establish your baseline.
Have the alerts let you know when a threshold is out of established normal thresholds.
Tweak said alerts to minimize false positives and you’ll evolve into something that alerts you when shit is going sideways.
Sroni4967@reddit
what stack did you end up keeping
Important-Bug-6709@reddit (OP)
We didn’t really land on a “clean” stack ,it’s mostly basic metrics + logs, and then SSH when something looks off we tried heavier setups before but it just wasn’t worth the maintenance for our size still not perfect, just the least annoying setup so far
Trip_Owen@reddit
Damn that a long ass sentence
Important-Bug-6709@reddit (OP)
Sorry, I just want to be as specific and detailed as possible to learn about other people's experiences.
KStieers@reddit
About to do that with our security stack.
Important-Bug-6709@reddit (OP)
What set of security measures are you currently using?
KStieers@reddit
Edr, email, ndr, sysmon like tool ->xdr & seim
I hear about the individual detections way faster and end up digging in there
Kind_Ability3218@reddit
if you get an alert that tells you about a problem and then you investigate that problem in a console then the alert did its job.
you need to decide for yourself if all of the tools are worth it. if you're repeating tasks find a way to automate them.
tools should solve a problem. sometimes the problem is, "i don't know enough about this and would like to learn".