⚙️DEVOPS
2024-09-1812 min

Building a Monitoring Stack: Prometheus + Grafana from Scratch

A complete guide to setting up production-grade monitoring with Prometheus, Grafana, and custom exporters for real infrastructure visibility.

#Prometheus#Grafana#Monitoring#Alerting#Observability

Why Build Your Own Monitoring?

Commercial monitoring solutions are expensive. For our 400+ machine deployment, we needed:

  • Infrastructure metrics (CPU, memory, disk, network)
  • Application metrics (request rates, error rates, latencies)
  • SSL certificate expiry monitoring
  • Custom business metrics
  • Alerting that actually works

The Architecture

bash
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Targets   │────▶│ Prometheus  │────▶│   Grafana   │
│  (Exporters)│     │  (Metrics)  │     │ (Dashboards)│
└─────────────┘     └─────────────┘     └─────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │ Alertmanager│
                    │  (Alerts)   │
                    └─────────────┘

Step 1: Prometheus Configuration

yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://api.example.com
        - https://app.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115

Step 2: Essential Exporters

Node Exporter (System Metrics)

bash
docker run -d --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter:latest \
  --path.rootfs=/host

Blackbox Exporter (Endpoint Monitoring)

yaml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []  # Defaults to 2xx
      follow_redirects: true
      preferred_ip_protocol: "ip4"
  
  ssl_expiry:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200, 301, 302]
      fail_if_ssl: false

Step 3: Alerting Rules

yaml
# alerts/node.yml
groups:
  - name: node
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space < 10% on {{ $labels.instance }}"
          
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert expires in < 14 days: {{ $labels.instance }}"

The Dashboard

Key panels every DevOps dashboard needs:

  • System Overview: CPU, Memory, Disk, Network
  • Service Health: Up/Down status for all endpoints
  • SSL Expiry: Days until certificate renewal
  • Error Rates: 5xx errors over time
  • Response Times: P50, P95, P99 latencies

Results

  • 5-minute mean time to detection (MTTD)
  • Proactive alerts before users notice issues
  • Historical data for capacity planning
  • $0/month monitoring costs (self-hosted)

$ echo "Thanks for reading!"

Share:
AZ

Written by

Ayoub Zakaria

DevOps / Cloud / MLOps Engineer

Building reliable infrastructure for 400+ machines. Sharing real-world DevOps challenges, solutions, and lessons learned.

Get in touch →

Want more DevOps insights?

Explore all articles