HostiServer
2026-06-30 12:54
Blackbox Monitoring in 2026: UptimeRobot, Gatus, Blackbox Exporter and When to Pick Which
External server availability monitoring: how to learn about the problem before your user does
The standard procedure for a serious production setup: an agent runs on the server (Prometheus node_exporter, Zabbix agent, Netdata, Datadog), collecting CPU, RAM, disk, and network metrics. Everything is logged, there are nice dashboards, alerts are configured. Full control, seemingly.
Until the moment the OS hangs, the network stack fails, or the server hits a kernel panic. The agent goes silent, because the whole server has gone silent. The dashboards show the last known state before the crash. There are no alerts. The user who is refreshing your homepage and seeing 408 Request Timeout (or a plain connection refused) learns about the problem faster than you do.
This is the classic blind spot of whitebox monitoring: it sees a lot on the inside, but sees nothing when there's no one on the inside to ask. External monitoring (blackbox) closes this gap from the opposite side, checking the server's availability from the outside, independently of its state.
ℹ️ Article series context: in our article "How to set up server logging and monitoring" we've covered the whitebox side in detail: collecting metrics from inside, log aggregation, Prometheus + Grafana, ELK Stack. The current article is about the other half of the puzzle: an independent check from the outside, without which any monitoring setup is incomplete.
1. Blackbox vs whitebox — two points of view
Whitebox monitoring sees the server "from the inside": it knows about processes, kernel metrics, disk I/O, network connections, application logs. It's deep, precise, and lets you not just note a problem but understand its cause.
Blackbox monitoring sees the server the way a user sees it: does the page open, does the API respond, is the SSL certificate valid, how long does the TCP handshake take. It's shallow, says nothing about the causes, but has one unique property: it works even when whitebox is silent.
| Parameter | Whitebox (from inside) | Blackbox (from outside) |
|---|---|---|
| What it sees | CPU, RAM, disk, processes, logs | Endpoint availability, response time, SSL, DNS |
| Data depth | High: down to a specific process | Shallow: "works / doesn't work" |
| Dependence on the server | Full: server dies, metrics die with it | None: monitoring is independent |
| Sees network problems between user and server | No | Yes |
| Reveals the cause of a problem | Yes | Only the symptom |
The conclusion is obvious: these two approaches don't replace each other, they complement each other. Whitebox says "why something isn't working." Blackbox says "it isn't working." On its own, each leaves blind spots. Together they cover most real failure scenarios.
2. What exactly to check from the outside
Before discussing tools, you need to understand the set of basic checks that make up any external monitoring setup.
HTTP/HTTPS
The basic check for web services. The monitoring agent makes a GET request to your endpoint and looks at:
- The HTTP code of the final response (200 is normal; 3xx redirects before it are fine, that's not a problem. But 4xx or 5xx as the final code — or in the chain after 3xx — is a failure)
- Response time (slows down, gives a warning before a full outage)
- Presence of a keyword in the response (e.g. a snippet of your homepage's text, to rule out the "serves the wrong thing" case)
- Validity of HTTP headers (e.g. redirect chains, cache-control)
ℹ️ HTTP/3 (QUIC) is a separate topic: in 2026 a large share of traffic goes over HTTP/3 on UDP port 443. Classic blackbox checks send requests over TCP (HTTP/1.1 or HTTP/2), so they don't see it if it's specifically QUIC that has broken. Mobile clients then fall back to TCP and get hundreds of milliseconds of extra latency on every connection. If you serve HTTP/3, you should separately check UDP-port availability or have monitoring that supports HTTP/3 (Prometheus's Blackbox Exporter added such support in 2024, and many modern SaaS platforms monitor QUIC out of the box).
TCP port
For non-standard services: SSH (22), SMTP (25, 465, 587), IMAP (143, 993), databases (5432, 3306, 27017), custom APIs on unusual ports. The agent tries to open a TCP connection and checks whether the server responds with SYN-ACK.
ICMP (ping)
The simplest check: "is the host alive at the network level." Useful as a quick indicator, but limited: many providers and CDNs block ICMP by default, so the false positive "ping didn't go through = the server is down" is a classic beginner mistake.
SSL certificate
A separate, underrated check. The monitoring agent connects to port 443, retrieves the certificate, and looks at its expiration date. You want to know about certificate expiry 30 days ahead, not at the moment when users are seeing scary browser warnings. In 2026 this is particularly relevant: certificate lifetimes keep shrinking (the CA/Browser Forum voted for a phased reduction to 47 days by 2029), so manual tracking stops working.
DNS
A check of correct DNS resolution for your domain from different locations. Catches situations where your DNS provider has a problem in a specific region and you don't know about it because in your region everything is working.
ℹ️ Check interval: for most scenarios 1-5 minutes is the sweet spot. 30-second checks create unnecessary noise and load, but they do make sense for critical e-commerce during peak hours. 10+ minutes is already too rare — you'll learn about the outage later than your customers do.
3. Tools and approaches
The external monitoring market is large and fragmented. All solutions can be split into three categories: SaaS, self-hosted, or hybrid. Each has its own advantages and its own scenarios.
3.1 SaaS: a quick start without your own infrastructure
SaaS services take on all the infrastructure work: you only register, enter the URL to check, and configure alerting channels. They have probe points in dozens of countries, which automatically gives you a global availability picture.
| Service | Free tier | Strengths |
|---|---|---|
| UptimeRobot | 50 monitors, 5-min interval | Simplest setup, a time-tested standard since 2010 |
| Better Stack (formerly Better Uptime) | 10 monitors, 3-min interval | Modern UI, good integration with incident management and status pages |
| StatusCake | 10 monitors, 5-min interval | UK-based service, good for GDPR-sensitive companies |
| Freshping | Trial only | Integration with the Freshworks stack, but the free tier was discontinued in 2024 |
When to pick SaaS: you don't have a dedicated engineer for self-hosted solutions; you want to set up monitoring in 15 minutes; your project is small or medium and the free tier will be enough; you need checks from genuinely global locations (South America, Asia, Oceania) that are hard to emulate with your own infrastructure.
What to watch out for: on free tiers the check interval is 3-5 minutes, so you learn about an outage with up to a 5-minute delay. Paid tiers (from $7/mo) give you 30-second checks. Also consider that you're sharing information about your infrastructure with a third party — for some regulated industries this can be an issue.
3.2 Self-hosted: Prometheus Blackbox Exporter
If you already have Prometheus deployed for whitebox monitoring, Blackbox Exporter is a natural extension. It's a separate process that runs HTTP/TCP/ICMP/DNS checks and exposes the results in Prometheus format.
A basic blackbox.yml config:
modules:
http_2xx:
prober: http
timeout: 5s
http:
method: GET
preferred_ip_protocol: "ip4"
valid_status_codes: [200, 301, 302]
fail_if_body_not_matches_regexp:
- "Welcome to Example"
tcp_connect:
prober: tcp
timeout: 5s
icmp_check:
prober: icmp
timeout: 5s
And the matching Prometheus section that scrapes this exporter for specific URLs:
scrape_configs:
- job_name: 'blackbox_http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Then alerting via Alertmanager on the metrics probe_success (0 = down) and probe_ssl_earliest_cert_expiry (time until SSL expiry). Visualization via Grafana with ready-made dashboards for Blackbox Exporter (there are plenty on grafana.com/dashboards).
When to pick it: you already have a Prometheus stack, have the skills to maintain it, and want a single alerting system for whitebox and blackbox.
3.3 Self-hosted: Gatus — a lightweight YAML alternative
Gatus is a younger and simpler option for those who don't want to deploy a full Prometheus stack just for external monitoring. A single Go binary, YAML config, built-in web UI with a status page, alerting via Slack/Telegram/Discord/PagerDuty/email out of the box.
A config.yaml for typical monitoring:
endpoints:
- name: main-website
url: "https://example.com"
interval: 60s
conditions:
- "[STATUS] == 200"
- "[RESPONSE_TIME] < 1000"
- "[CERTIFICATE_EXPIRATION] > 720h"
alerts:
- type: telegram
failure-threshold: 3
success-threshold: 2
- name: api-health
url: "https://api.example.com/health"
interval: 30s
conditions:
- "[STATUS] == 200"
- "[BODY].status == up"
alerting:
telegram:
token: "${TELEGRAM_TOKEN}"
id: "${TELEGRAM_CHAT_ID}"
That's it. It runs as one docker run or as a systemd service on a VPS, consumes 30-50 MB RAM, and gives you a public status page for clients. In 2026 Gatus became one of the most popular open-source solutions precisely because of its simplicity.
When to pick it: you don't need Prometheus, you want the simplest possible system with low overhead, and you're interested in a public status page for clients.
3.4 Hybrid: self-hosted + SaaS for critical endpoints
The most sensible approach for a serious production setup. The logic goes: you do the bulk of the checks self-hosted (cheap, controlled, gives you all the data), and the most critical endpoints are duplicated through SaaS (a fallback for when your monitoring infrastructure itself is unavailable).
What usually gets offloaded to SaaS in a hybrid setup:
- The public homepage: what clients see first
- Login/auth endpoint: without it clients can't get in, even if the site opens
- Payment page: the most expensive downtime, worth paying for 30-second checks
- Public API health endpoint: if you have a public API with an SLA
The remaining 50-200 endpoints run on self-hosted infrastructure, with your own dashboards and full control over the data.
3.5 Comparison of the four approaches
| Approach | Price | Reliability | Data control | Complexity |
|---|---|---|---|---|
| SaaS (UptimeRobot, Better Stack) | $0-50/mo | High (vendor-guaranteed) | Low — at a third party | Minimal |
| Self-hosted Prometheus + Blackbox | Cost of a VPS + engineer time | Depends on your DevOps | Full | High |
| Self-hosted Gatus | Cost of a single VPS | Depends on your DevOps | Full | Medium |
| Hybrid (self-hosted + SaaS) | $10-30/mo + VPS | Highest | Partial | Depends on the base |
There is no universal answer: for a small project SaaS is better, for a medium one Gatus, for a large one with its own infrastructure and DevOps team a hybrid with Prometheus as the base fits.
4. Where to physically host the monitoring agent
This is the part that often gets skipped when setting up monitoring, but it's exactly what determines how useful the monitoring will be in the moment of a real outage.
4.1 Not on the same server that's being monitored
An obvious but frequently broken rule: the monitoring agent shouldn't live on the server it's checking. If both your application and your monitoring run on the same VPS, when that VPS goes down you'll learn nothing: the monitoring died with the server. This is a classic mistake that even experienced admins make when they're "saving money."
4.2 The same provider in the same data center
If your entire stack (production server, backups, monitoring) is at one provider in a single DC, your monitoring has a blind spot the size of that DC. If the provider has a network, power, or cooling problem, your server and your monitoring both go down at once — and nobody will tell you about it.
The fix: put the monitoring agent on a VPS at a different provider, or at least in a different geographic region. It can be the cheapest VPS at $3-5/mo, its only job is to ping your production server periodically and alert if something's off. The small extra cost pays for itself the one time it fires for real.
4.3 One provider across several DCs
If your provider has multiple data centers and you rent servers in more than one, the situation is more interesting. Monitoring agents can (and should) be placed in each DC, so you cover not just the availability of individual servers but also the connectivity between DCs. Then, in case of a problem, you'll see not just "server X is unreachable" but "server X is unreachable from DC Y but reachable from DC Z," which immediately points to the nature of the problem.
The topology details depend on the specific infrastructure. If you have two DCs, two probe nodes are enough. If four, worth covering each.
4.4 SaaS monitoring as the last resort
Even with the most sophisticated own infrastructure, there remains a scenario in which all of it is unavailable at once (a disaster at the provider, a BGP problem for an entire region, or even a mass outage of your primary DNS provider). In that case only SaaS monitoring from external infrastructure works.
So even in a fully self-hosted setup it's worth having at least one free UptimeRobot monitor with email/Telegram alerting that only fires in the event of a full blackout. One free-tier account costs zero, sets up in 5 minutes, and one day will save you from the situation where you don't know for three hours that literally nothing is working.
5. Alerting: channels, thresholds, false positives
Setting up the checks is only half the job. The other half is that you should learn about the outage exactly when you need to, and not learn about it when you don't. Bad alerting creates alert fatigue, through which real incidents get lost in the flood of false positives.
Delivery channels
- Email: slow (you might not see it for hours), but archival. Fits non-critical warnings like "SSL expires in 30 days."
- Telegram / Slack: fast, with the possibility of reactions from colleagues. The standard during business hours.
- SMS / phone call: for night-time incidents. Costs money, but often the only way to reliably "wake" an on-call engineer.
- PagerDuty / Opsgenie: escalation policy, on-call rotation, integration with everything. For teams of 5+ engineers.
The "three in a row" rule
The simplest way to reduce false positives: an alert fires only after N consecutive failed checks. The typical value is 3 consecutive failures at a 1-minute interval. You learn about a real outage 3-4 minutes after it starts, but you don't get alerts about random 30-second network hiccups.
"Warning" and "critical" thresholds
Not all problems are equally serious. It helps to have two levels:
- Warning: response time grew from 200ms to 1000ms, SSL expires in 30 days, 1-2 consecutive failed checks. Telegram notification, no nighttime alarm.
- Critical: 3+ consecutive failed checks, response time >5 seconds, SSL expires in 7 days. Telegram + email + SMS during off-hours.
Escalation for prolonged incidents
If an alert hasn't been acknowledged within 15 minutes, escalate to the next on-call or to the team lead. This is basic hygiene for teams with night-time coverage. If one notification failed to wake the on-call, a second one 15 minutes later has a better chance.
⚠️ The most common alerting mistake: single-level alerts for everything. When "response time >300ms" and "server unreachable for 5 minutes" arrive as the same kind of notification, within a month you'll be ignoring both. Splitting into warning and critical, with different channels and delivery times, isn't overkill, it's a necessity.
6. Minimum production setup
If you don't have time or the appetite to build the perfect system, here's the floor below which you shouldn't drop:
Must-have
- At least one external monitoring agent from a different location (not at the same provider hosting the primary service)
- An HTTP availability check on the homepage and one critical API endpoint
- An SSL certificate check with an alert 30 days before expiry
- Alerting through at least two channels (Telegram + email)
- The "3 consecutive failures" rule before alerting (wherever it's configurable), so you don't react to one-off hiccups
Nice to have
- 2-3 probe locations across different geographies to catch regional issues
- Two-level alerting (warning + critical) with different delivery channels
- A public status page for clients (Gatus does this out of the box)
- Integrating monitoring with incident management (PagerDuty, Opsgenie, or at least a shared inbox)
- Regular alert review: which fire often, which failed to fire when they should have
Common mistakes
- Probe on the same host. The first thing that needs fixing. Move monitoring off the production server.
- Overly aggressive thresholds. An alert on a 200ms latency will fire all day and solve nothing.
- No escalation. If the on-call doesn't answer, nobody will find out.
- A single delivery channel. If Telegram goes down along with your ISP, you won't see the alert.
- Alerts without a runbook. An alert "site unreachable" with no instructions on what to do is half the value of monitoring gone.
7. Summary
External monitoring isn't an alternative to internal, but a mandatory complement. Whitebox metrics give depth and help you understand the cause; blackbox checks guarantee that you'll learn about an outage regardless of the server's state. Without one of these halves your monitoring has a blind spot.
What to take away from this article:
- The probe shouldn't live on the same server it's monitoring. If it does, in the moment of a real outage there will be no alerts.
- A 1-5 minute check interval is the sweet spot for most scenarios. 30 seconds only makes sense for critical e-commerce; 10+ minutes is too rare.
- The "3 consecutive failures before alerting" rule kills most false positives and doesn't miss real incidents.
- Two-level alerting (warning + critical on different channels) is hygiene, not overkill.
- An SSL check with a 30-day threshold is the cheapest and most useful thing that will one day save you from scary browser warnings shown to clients.
The next layer of OS independence to move to after external monitoring is set up is out-of-band management via IPMI, iDRAC, or iLO. These are hardware interfaces on the motherboard that work independently of the main OS: they let you see the console even when the server has "died," reboot it, mount ISO images remotely. It's a big topic and deserves a separate article.
🚀 Monitoring infrastructure from Hostiserver
External monitoring needs separate infrastructure — a separate VPS in a different region, possibly several. Hostiserver provides exactly those: affordable KVM VPS for probe nodes and full dedicated servers for the primary production stack.
💻 Cloud (VPS) Hosting
- From $19.95/mo, KVM isolation, dedicated vCPU and RAM
- Ready OS images: Ubuntu 24.04, Debian 12, Rocky Linux 9 out of the box
- Separate locations, ideal for hosting probe nodes outside the primary DC
- API access for automating monitoring rollout via Terraform
- 24/7 DevOps support: engineers will help set up Prometheus, Gatus, Grafana
🖥️ Dedicated Servers
- From $90/mo, a physical server with full hardware control
- Separate DCs for distributed infrastructure
- Out-of-band access via IPMI/iDRAC/iLO for full OS independence
- 99.9% uptime SLA guaranteed in the contract
💬 Not sure which option fits you?
💬 Drop us a line and we'll help you figure it out!
Frequently asked questions
- How many monitoring locations do you actually need for a typical business site?
For a typical business site with an audience in one country: 2-3 locations, at least one in your target region (to see the real latency for your customers) and at least one in a different region (to catch regional network problems). For global e-commerce or SaaS: 5-10 locations across different continents. Beyond that it's marketing bragging like "230 probe points" — in real operation the difference between 10 and 200 points is invisible.
- Is it worth monitoring from different ISPs?
Yes, especially if your audience is regionally concentrated. Big ISPs occasionally have routing issues that only affect their own customers. If all your monitoring points sit inside the network of one cloud provider (AWS, Hetzner, OVH), you'll see "everything's fine here" while real users on a specific ISP can't get in. Good SaaS services deliberately have points in different AS-es.
- How do you set up monitoring for a private API behind VPN or firewall?
There are several options. First: put a probe node inside the private network (for example, on a bastion host) and give it egress for alerts. Second: create a separate health check endpoint that doesn't require auth but returns 200 OK only if the real business logic is working (for example, checks the database connection). Third: for SaaS monitoring, some services (Better Stack, Datadog) give you outbound IP addresses that you can add to your firewall's whitelist.
An important nuance about IP whitelisting: the probe-address lists of large SaaS platforms are constantly changing — vendors add new locations, retire old ones, do A/B routing. If you add these IPs to your firewall manually, your monitoring will sooner or later break when the vendor launches a new probe node with a new address. The correct way is dynamic import of the list via API or a regularly published JSON/TXT file (most vendors provide one) and updating the whitelist automatically.
- What's better for a status page: build your own or buy a ready one?
If you already have Gatus or Uptime Kuma — the status page is free out of the box. If you're on the Prometheus stack — a Grafana public dashboard also works. Buying a separate product (Statuspage.io, Status.io, Better Stack Status Pages) makes sense when you need tight integration with incident management (client subscriptions, incident history, post-mortem publication), or you want full brand control. For most projects the self-hosted option is enough.
- Does it make sense to monitor DNS separately?
Yes, especially if you depend on a specific DNS provider (Route 53, NS1, your registrar). A DNS provider can have a problem in a particular region, and your site becomes unreachable for part of your users while monitoring from a different region says "all OK." Simple checks: does the domain resolve, does it resolve correctly (comparison to the expected IP). Tools like Better Stack and Gatus have this out of the box.
- How much does it cost to build a normal-grade monitoring for a small business?
A hybrid setup for a small e-commerce looks roughly like this: free-tier UptimeRobot (10 monitors, $0) + one VPS with Gatus at a different provider ($5-10/mo) + a Telegram bot for alerts ($0) = $5-10/mo. If you want SaaS-only: Better Stack starter tier ($24/mo) or UptimeRobot Pro ($7/mo) covers the needs of a typical shop. For a medium business with 50+ endpoints and 24/7 on-call: $50-200/mo depending on depth. For enterprise with SLA and regular audits: $500+/mo and your own team.