Ayoub Zakaria | DevOps · Cloud

Why Build Your Own Monitoring?

Commercial monitoring solutions are expensive. For our 400+ machine deployment, we needed:

Infrastructure metrics (CPU, memory, disk, network)
Application metrics (request rates, error rates, latencies)
SSL certificate expiry monitoring
Custom business metrics
Alerting that actually works

The Architecture

bash

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Targets   │────▶│ Prometheus  │────▶│   Grafana   │
│  (Exporters)│     │  (Metrics)  │     │ (Dashboards)│
└─────────────┘     └─────────────┘     └─────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │ Alertmanager│
                    │  (Alerts)   │
                    └─────────────┘

Step 1: Prometheus Configuration

yaml

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://api.example.com
        - https://app.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115

Step 2: Essential Exporters

Node Exporter (System Metrics)

bash

docker run -d --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter:latest \
  --path.rootfs=/host

Blackbox Exporter (Endpoint Monitoring)

yaml

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []  # Defaults to 2xx
      follow_redirects: true
      preferred_ip_protocol: "ip4"
  
  ssl_expiry:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200, 301, 302]
      fail_if_ssl: false

Step 3: Alerting Rules

yaml

# alerts/node.yml
groups:
  - name: node
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space < 10% on {{ $labels.instance }}"
          
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert expires in < 14 days: {{ $labels.instance }}"

The Dashboard

Key panels every DevOps dashboard needs:

System Overview: CPU, Memory, Disk, Network
Service Health: Up/Down status for all endpoints
SSL Expiry: Days until certificate renewal
Error Rates: 5xx errors over time
Response Times: P50, P95, P99 latencies

Results

5-minute mean time to detection (MTTD)
Proactive alerts before users notice issues
Historical data for capacity planning
$0/month monitoring costs (self-hosted)

Building a Monitoring Stack: Prometheus + Grafana from Scratch

Why Build Your Own Monitoring?

The Architecture

Step 1: Prometheus Configuration

Step 2: Essential Exporters

Node Exporter (System Metrics)

Blackbox Exporter (Endpoint Monitoring)

Step 3: Alerting Rules

The Dashboard

Results

📖Continue Reading

Docker Swarm in Production: Lessons from 400+ Nodes

Want more DevOps insights?