What metrics do you actually track for website/server monitoring ?

Posted by nilkanth987@reddit | sysadmin | View on Reddit | 28 comments

There are so many things you can monitor - uptime, response time, CPU, memory, error rates, logs, etc.

But in reality, I’m curious what people here actually rely on day-to-day.

If you had to keep it simple, what are the few metrics that genuinely helped you catch real issues early?

Also curious:

What did you stop tracking because it was just noise?
Any metrics that sounded important but never really helped?

Trying to avoid overcomplicating things and focus on what actually matters in production.

[-]

Maleficent_Proof6911@reddit

The best way to monitor something is to do it from a clients point of view. Monitoring an asset from within a data center or a specific network might show you that everything is working correctly, but same clients might not be able to access your page/application. Tools like websitepulse.com provide transaction monitoring for all possible scenarios from more than 40 locations around the world. In some cases if a specific backbone connection is not working correctly some clients might experience higher response times, so monitoring your website/server from different geographic location can give you the heads up before you start loosing money. In my opinion monitoring a transaction scenario is much more real and effective than monitoring a website or a server. If you can go through an ordering process from outside your network for example this means that everything is working correctly on your end ( scripts, forms, frames, images. databases, etc. ). It is much better to monitor a single transaction that multiple servers and databases. At the end the response times and the uptime are the metrics which you want to track and make sure that you are not loosing clients and revenue.

[-]

chickibumbum_byomde@reddit

started with a really long list of metrics, but over time got trimmed down to what actually is relevant and necessary to catch real issues.

In practice, it comes down to a few things, is the service reachable, is it responding fast enough, and is the system under stress. Uptime, response time, and basic resource usage like CPU, memory, and disk tend to cover most real-world problems early. best tip i recieved, Just because you can monitor something doesn’t mean it’s useful, especially if it never results in a decision or alert.

using checkmk atm, for both work and homelab, cannot complain, i have trimmed it down to "Root cause" monitoring, set my thresholds to basic Usages/resource monitoring, and a few specific special agent monitoring VMs, Container health, Clustering health and etc..In the end, you wanna monitor the few things that tell you “something is wrong” early, and ignore the rest until you actually need it.

[-]

nilkanth987@reddit (OP)

That’s a great way to put it - “just because you can monitor something doesn’t mean it’s useful.”

Feels like a lot of setups start with everything and then slowly converge to a few signals that actually drive decisions.

Interesting that you mentioned “something is wrong early”, do you rely more on response time for that, or resource usage like CPU/memory?

[-]

SudoZenWizz@reddit

We are monitoring our clients webapps (LAMP Stacks) with checkmk (as clients and partners of checkmk).

We monitor TCP connections, Apache status and workers, php-fpm logs and status, processes for php (the apps are doing forking to php cli) and mysql databases.

Many times (yesterday last one) was with an user doing "stupid" things in app and having hundreds of requests per second. We cought it from alerts received for HTTP active checks for entry point and number of php processes.

With checkmk agent we also saw that diskio spiked and identified the culprit.

[-]

nilkanth987@reddit (OP)

That’s a pretty comprehensive setup.

Interesting how the combination of HTTP checks + process-level metrics helped catch that, especially with something like a sudden spike from user behavior.

Also nice that you could actually trace it back to the culprit with disk I/O + process visibility.

Do you find most issues come from user-triggered spikes like that, or more from system-level bottlenecks ?

[-]

SudoZenWizz@reddit

From these types of apps: always the user or other api integrations(that i consider user in the end)

[-]

SquashNo7817@reddit

This depends on the server type. If you run something like Amazon.com then do everything - read Google 's SRE book

How many users?

[-]

nilkanth987@reddit (OP)

Makes sense, scale changes everything.

At what point do you think it’s worth moving from “simple signals” to full SRE-level monitoring ?

[-]

SquashNo7817@reddit

Why don't you post some information or a verified link? It is as if you are building a Master's thesis for people's responses.

[-]

nilkanth987@reddit (OP)

Fair enough, didn’t mean to come across that way.

Was just trying to understand how people approach monitoring in real setups. Appreciate the input.

[-]

nilkanth987@reddit (OP)

I totally understand why you would think like this.

In fact, I have been working a lot in this area because I am developing an application for monitoring. So I was trying to figure out what works best in reality rather than on paper.

Your input has been extremely helpful to me; most of your answers emphasize simplicity and practicality rather than collecting everything.

[-]

NaturalIdiocy@reddit

nah, most of them are just building out their marketing campaign or their vibecode prompt

[-]

SquashNo7817@reddit

Shitty spamming your job. Gtfo

[-]

03263@reddit

DISK USAGE

have seen it happen many times, out of control log file fills up the whole disk and even getting an ssh session becomes impossible, most programs stop working because they can't write a single bit to any cache or config files, can cause quite a mess.

[-]

nilkanth987@reddit (OP)

100% this.

Disk issues always seem “minor” until everything suddenly stops working 😅
Logs filling up has probably caused more downtime than most “complex” failures.

Do you usually set alerts at a fixed threshold (like 80–90%), or based on growth trends ?

[-]

03263@reddit

Yeah 80-90% and hope it's not an out of control process but a slow growth that can be stopped before it gets bad. I've had times I'm deleting temp files and it just fills right back up almost instantly, and can be hard to track down what's writing it (in that case I think it was a web script running under apache)

[-]

nilkanth987@reddit (OP)

Yeah that’s chaos 😅 clean it → instantly full again.

At that point it’s like “okay what is spamming writes??”

How do you usually track that down ?

[-]

JoeJ92@reddit

We use LogicMonitor for all servers, certs, networking kit etc etc. Our red flags usually revolve around the normal, uptime, response, disk space.

We use ControlUp for Endpoint/AVD monitoring, dashboards, remote control. We also use their Scout bees product to monitor our web apps, which is definitely a handy tool to have, as you can simulate basic user actions and monitor response times.

[-]

nilkanth987@reddit (OP)

That’s a solid setup.

The part about simulating user actions is interesting, feels like that’s where basic uptime checks fall short, especially for catching issues that only show up during real flows.

Do you find those synthetic checks catch problems earlier than standard uptime/response monitoring ?

[-]

JoeJ92@reddit

Yeah it's definitely helped catch things early, we have an webapp that our claims team uses, it's very SQL query heavy when using the search function. It has helped detect SQL server issues in the past and helped us tweak our scale sets to alleviate the issue at peak times, and scale back down out of peak hours. We also just have a generic IIS server that hosts a bunch of internal web apps, it just generally helps detect websites running like shit.

We have it hooked into some SaaS apps as well, which we've used in tickets with Vendors in the past as evidence for their systems running poorly. It's just a good visual tool to have, besides relying on users saying "it's slow"

[-]

AmazingHand9603@reddit

I keep it pretty basic and it covers most real issues. Uptime monitoring is non-negotiable, then response time for the site. Error rates for the main app endpoints. The only resource stats I watch are CPU and memory on the web server itself. I used to log more things like network throughput or disk IO, but it turned into a pile of graphs that were never useful for finding problems unless something was already melting. The biggest thing I stopped tracking was super granular logs and traffic metrics; they were just noise for day-to-day. If the site is up and fast, and errors aren’t spiking, I’m happy.

[-]

nilkanth987@reddit (OP)

Really like the “if it’s up, fast, and errors are stable, it’s good” mindset.

Feels like a lot of people over-track and end up ignoring everything.

Curious - was there a specific incident that made you simplify your monitoring setup this much ?

[-]

AmazingHand9603@reddit

Yeah. I had a slowdown once and spent too much time on graphs that were not helping. The real problem was clear from response time and error spikes, so I simplified my setup after that.

[-]

nilkanth987@reddit (OP)

That’s a great example of signal vs noise.

It’s interesting how the metrics that actually matter are usually the simplest ones, but they get buried under everything else.

Did you end up setting alerts specifically on response time + error spikes after that ?

[-]

DrockByte@reddit

Had to double check this wasn't posted by the same guy who had three servers die in one year, and blamed the customer for his own lack of monitoring.

[-]

nilkanth987@reddit (OP)

lol yeah definitely trying to avoid being that guy 😅
Figured better to learn from others before something breaks.

[-]

DeifniteProfessional@reddit

Obvious ad bait is obvious

[-]

nilkanth987@reddit (OP)

Fair take, but this was genuinely just to understand what people actually monitor in real setups. Trying to avoid overengineering things.