Why Build Your Own Monitoring?
Commercial monitoring solutions are expensive. For our 400+ machine deployment, we needed:
- Infrastructure metrics (CPU, memory, disk, network)
- Application metrics (request rates, error rates, latencies)
- SSL certificate expiry monitoring
- Custom business metrics
- Alerting that actually works
The Architecture
bash
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Targets │────▶│ Prometheus │────▶│ Grafana │
│ (Exporters)│ │ (Metrics) │ │ (Dashboards)│
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Alertmanager│
│ (Alerts) │
└─────────────┘Step 1: Prometheus Configuration
yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "alerts/*.yml"
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.example.com
- https://app.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115Step 2: Essential Exporters
Node Exporter (System Metrics)
bash
docker run -d --name node-exporter \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
prom/node-exporter:latest \
--path.rootfs=/hostBlackbox Exporter (Endpoint Monitoring)
yaml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [] # Defaults to 2xx
follow_redirects: true
preferred_ip_protocol: "ip4"
ssl_expiry:
prober: http
timeout: 5s
http:
valid_status_codes: [200, 301, 302]
fail_if_ssl: falseStep 3: Alerting Rules
yaml
# alerts/node.yml
groups:
- name: node
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space < 10% on {{ $labels.instance }}"
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
for: 1h
labels:
severity: warning
annotations:
summary: "SSL cert expires in < 14 days: {{ $labels.instance }}"The Dashboard
Key panels every DevOps dashboard needs:
- System Overview: CPU, Memory, Disk, Network
- Service Health: Up/Down status for all endpoints
- SSL Expiry: Days until certificate renewal
- Error Rates: 5xx errors over time
- Response Times: P50, P95, P99 latencies
Results
- 5-minute mean time to detection (MTTD)
- Proactive alerts before users notice issues
- Historical data for capacity planning
- $0/month monitoring costs (self-hosted)