⚙️DEVOPS
2024-10-2210 min

Docker Swarm in Production: Lessons from 400+ Nodes

Real-world insights from running Docker Swarm at scale, including secrets management, rolling updates, and debugging strategies.

#Docker#Docker Swarm#Containers#Orchestration#CI/CD

Why Docker Swarm?

While Kubernetes gets all the hype, Docker Swarm remains an excellent choice for:

  • Teams already familiar with Docker Compose
  • Simpler deployments without the K8s complexity
  • Resource-constrained environments
  • Quick setup and low operational overhead

After running Swarm across 400+ nodes in production, here are the lessons learned.

Lesson 1: Secrets Management Done Right

Never put secrets in your compose files. Use Docker secrets:

bash
# Create a secret
echo "super_secret_password" | docker secret create db_password -

# Use in compose
services:
  api:
    image: myapp:latest
    secrets:
      - db_password
    environment:
      DB_PASSWORD_FILE: /run/secrets/db_password

secrets:
  db_password:
    external: true

Lesson 2: Health Checks Are Critical

Without proper health checks, Swarm can't make intelligent routing decisions:

yaml
services:
  api:
    image: myapp:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Lesson 3: Rolling Updates Strategy

Control your deployments to avoid downtime:

yaml
services:
  api:
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 30s
        failure_action: rollback
        order: start-first
      rollback_config:
        parallelism: 1
        delay: 10s

Lesson 4: Debugging in Swarm

When things go wrong:

bash
# Check service status
docker service ps myapp --no-trunc

# View logs across all replicas
docker service logs -f myapp

# Inspect why tasks are failing
docker inspect $(docker service ps -q myapp | head -1)

Lesson 5: Resource Limits

Always set limits to prevent runaway containers:

yaml
services:
  api:
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M

The Results

  • Zero-downtime deployments with rolling updates
  • Automatic recovery from node failures
  • Simplified operations compared to K8s
  • Happy developers with familiar tooling

$ echo "Thanks for reading!"

Share:
AZ

Written by

Ayoub Zakaria

DevOps / Cloud / MLOps Engineer

Building reliable infrastructure for 400+ machines. Sharing real-world DevOps challenges, solutions, and lessons learned.

Get in touch →

Want more DevOps insights?

Explore all articles