Uptime Monitoring·8 min read

“Is the server up?” Is the Wrong Question. Here’s the Right One.

A ping response tells you a server is reachable. It tells you nothing about whether your website loads, your API returns valid data, or your checkout flow works. Real uptime monitoring checks what your users actually experience — and the gap between the two is where outages hide.

AW

Aideworks Team

hello@aideworks.com

Key takeaways

  • Ping and ICMP checks confirm a server is reachable — not that your application works.
  • A real HTTP check verifies the full response: status code, response time, and that the application actually responded.
  • APIs can return HTTP 200 while silently delivering error payloads — only content-aware monitoring catches this.
  • Response time matters: a site that takes 8 seconds to load is effectively down for most users.
  • Two-check confirmation before alerting eliminates false alarms from transient network blips.

The ping problem — why ICMP is not uptime monitoring

For decades, network engineers have used ping — an ICMP echo request — as a quick sanity check that a host is alive. It's fast, simple, and available everywhere. And it is completely useless for telling you whether your users can access your service.

Ping operates at the network layer. It asks: “is this IP address reachable?” It says nothing about what's running on that IP, whether a web server is listening, whether an application has crashed, whether a database connection has been exhausted, or whether the response your users receive makes any sense.

A server can respond to ping flawlessly while your Apache or Nginx process has hung, your Node application has run out of memory and is returning 503s, or your PHP-FPM pool is saturated and every request is timing out after 30 seconds. Ping says: “all good.” Your users say otherwise.

Ping / ICMP check

  • Checks: is the IP reachable?
  • Does not verify web server is running
  • Does not check HTTP status code
  • Does not measure page response time
  • Misses application-level errors (500, 503)
  • Misses "silent failures" — server up, app broken

HTTP response check

  • Checks: did the app respond correctly?
  • Verifies HTTP status code (2xx / 3xx)
  • Measures full response time end-to-end
  • Follows redirect chains (301, 302)
  • Catches app crashes, 500 errors, timeouts
  • Reflects the actual user experience

What a real HTTP check actually verifies

A proper uptime check makes an HTTP GET request to a URL — typically https://yourdomain.com/— and evaluates the response. At minimum, it verifies:

  • The TCP connection succeeds — the server accepted the connection on port 443.
  • The TLS handshake completes — the certificate is valid and the encrypted channel is established.
  • The HTTP response code is in the expected range — typically 2xx for operational endpoints, or 3xx if a redirect is expected.
  • The response arrives within a timeout — if the server takes 30 seconds to respond, that is a failure regardless of the status code.

Each of these steps can fail independently. A server that accepts TCP connections but stalls the TLS handshake is effectively down. A server that completes TLS but returns HTTP 503 is serving errors. A server that returns HTTP 200 but takes 12 seconds to do so is functionally broken for most users.

Ping checks none of this. It only validates step zero: network reachability.

Real-world failures that ping never sees

The PHP-FPM death spiral

A high-traffic shared hosting server experiences a surge. PHP-FPM worker processes are exhausted. New requests queue, then time out. The web server returns 504 Gateway Timeout to every user. Meanwhile, the server itself is perfectly reachable — ping responds in 2 ms. An ICMP-based monitor shows green. Every single user sees an error page.

The silent deploy that broke routing

A developer deploys a new version of a Next.js application. A misconfigured rewrite rule sends all requests to a 404 handler. The server responds instantly with HTTP 404 to every request. Users cannot access any page of the site. Ping: 100% success rate. HTTP check: immediately detects the 404 and pages the on-call engineer.

The database connection pool exhaustion

A Node.js API service starts leaking database connections due to a missing finally block. Over four hours, the connection pool fills up. Requests begin returning HTTP 500 as the ORM throws connection errors. The API container is running. The network is fine. Ping is happy. Users are getting error responses on every single request.

The “it works on the home page” problem

Some teams monitor only the root URL of a domain. The root URL serves a static marketing page from a CDN — it almost never goes down. Meanwhile, the /api/checkout endpoint, which hits the payment service and database, has been returning 502 for 40 minutes. Revenue has been silently failing. The root URL monitor shows green throughout.

Why API monitoring is a completely different problem

APIs introduce a new failure mode that HTTP status codes alone cannot catch: the “200 OK but everything is wrong” problem.

Many APIs return HTTP 200 for all responses — including errors — and encode the actual outcome in the response body as a JSON field like {"status": "error", "message": "..."}. A naive HTTP check that only validates the status code will miss these failures entirely.

Even APIs that correctly use HTTP error codes have subtleties. An endpoint might return HTTP 200 with an empty array when it should return 50 records — indicating a silent data pipeline failure. It might return HTTP 200 with stale cached data from three days ago. It might return 200 in under 100 ms for simple queries but time out on any request that touches more than a few database rows.

Practical API monitoring checklist

  • Monitor the health endpoint (/health, /status, /ping) — most modern APIs expose one.
  • If no health endpoint exists, monitor a lightweight read-only endpoint that exercises the database.
  • Set a response time threshold — an API that responds in 8 seconds is as useful as one that is down.
  • Use a dedicated monitoring token with read-only permissions, not a shared API key.
  • Alert on 4xx responses from your health endpoint — they indicate configuration or auth breakage, not just load.

Response time: the invisible half of uptime

There is a well-known UX research finding from the 1990s that still holds: users perceive responses under 100 ms as instant, up to 1 second as acceptable, up to 10 seconds as the limit of attention, and above 10 seconds as a reason to navigate away. For e-commerce, the correlation with conversion rate is direct and measurable.

A page that loads in 6 seconds is “up” in the technical sense — the server is responding, the status code is 200, no errors are thrown. But the bounce rate climbs. Conversion drops. Users who abandon the session may never return. The monitoring dashboard shows green throughout.

Response time monitoring captures this. By recording how long each check takes from the first byte of the request to the last byte of the response, you build a baseline of normal performance. When response time doubles — from 400 ms to 800 ms — that is a signal worth investigating even if the site is technically “up”. It may indicate a slow database query, a CDN misconfiguration, a memory leak causing garbage collection pauses, or the early signs of a traffic surge that will become an outage in two hours.

# Response time trend — api.clienta.nl

12:00 GET /api/status → 200 142 ms

12:01 GET /api/status → 200 138 ms

12:02 GET /api/status → 200 151 ms

12:03 GET /api/status → 200 892 ms ⚠ slow

12:04 GET /api/status → 200 1 240 ms ⚠ slow

12:05 GET /api/status → 504 timeout ✕ DOWN

12:06 GET /api/status → 504 timeout ✕ DOWN → alert sent

Notice the pattern: response time degraded over three minutes before the outage occurred. If you are recording response time, you can see the trend. If you are only tracking up/down status, you saw green right up until the moment it failed — with no warning.

The false alarm problem — and how to solve it

Any monitoring system that runs frequent checks will occasionally see transient failures — a single TCP connection that didn't complete, a DNS resolution that timed out for one check before succeeding on the next, a CDN edge node that hiccupped for 200 ms. If every single one of these triggers an alert, the monitoring system becomes noise, and engineers learn to ignore it.

The standard solution is consecutive-failure confirmation. Rather than alert on the first failed check, require two (or three) consecutive failures before triggering a notification. This adds at most one check interval of latency to your alert time — typically one to two minutes — and eliminates nearly all false positives from transient network issues.

The flip side is recovery confirmation. When a site that was down starts responding again, you want to know immediately — but you also do not want to send a recovery notification for a single successful check that is immediately followed by another failure. A single successful response after a confirmed outage is sufficient for a recovery alert, since the alternative — waiting for two successes — delays the “all clear” by unnecessary minutes.

What to actually monitor in practice

The right set of endpoints to monitor depends on your application, but some principles apply broadly:

For websites and landing pages

Monitor the root URL and any page that represents a critical user journey. For an e-commerce site, that means / (home), /products (catalogue), and/checkout (conversion). Do not just monitor the home page and assume the rest is fine — CDN-cached pages and dynamically rendered pages have very different failure modes.

For web applications behind login

Expose a dedicated /health endpoint that does not require authentication but exercises the critical application path — at minimum a database read and a cache read. Return JSON with component-level status fields. Monitor this endpoint with a 2xx expectation and a tight response time threshold (e.g. under 500 ms).

For REST and GraphQL APIs

Monitor your /health or /status endpoint. If it does not exist, create one — the operational overhead is negligible and the monitoring value is high. Additionally, monitor one representative read endpoint with a real query that exercises the database. A health endpoint that only checks “is the process running” will not catch a corrupted database or a broken connection pool.

For third-party integrations and webhooks

Many applications depend on third-party services — payment processors, email APIs, CRM webhooks. Monitor any endpoint in your own infrastructure that receives these integrations. A broken Stripe webhook handler that silently drops payment confirmations will not show up in any server metric — only in missing orders.

Monitor what your users actually experience

Aideworks checks every monitored URL every minute via real HTTP requests — not ping. Response time logged on every check. Instant alert after two consecutive failures. Recovery notification the moment your site comes back.

Start monitoring free