Chapter 7: Container Monitoring and Logging

Haiyue
21min

Chapter 7: Container Monitoring and Logging

Learning Objectives
  • Master container performance monitoring methods and tools
  • Learn to collect and analyze container logs
  • Understand container health check mechanisms
  • Proficiently use monitoring tools for troubleshooting

Knowledge Points

Importance of Monitoring

Container monitoring is a critical component for ensuring stable application operations. An effective monitoring system should include:

Monitoring DimensionMonitoring MetricsTool Examples
Resource UsageCPU, Memory, Disk, Networkdocker stats, cAdvisor
Application PerformanceResponse Time, Throughput, Error RateAPM Tools, Custom Metrics
Log AnalysisApplication Logs, System Logs, Error LogsELK Stack, Fluentd
Health StatusService Availability, Dependency ChecksHealth Check, Probe

Monitoring Architecture Patterns

Container Monitoring Architecture:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   App Container   │    │   App Container   │    │   App Container   │
└─────────────┘    └─────────────┘    └─────────────┘
        │                   │                   │
        └───────────────────┼───────────────────┘

                ┌─────────────┐
                │  Monitoring Agent    │ (cAdvisor, Node Exporter)
                └─────────────┘

                ┌─────────────┐
                │  Monitoring System    │ (Prometheus, Grafana)
                └─────────────┘

                ┌─────────────┐
                │  Alerting System    │ (AlertManager, PagerDuty)
                └─────────────┘

Docker Built-in Monitoring

docker stats Command

# Monitor all containers in real-time
docker stats

# Monitor specific containers
docker stats web-server api-server

# Formatted output
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}"

# One-time output (no continuous refresh)
docker stats --no-stream

# Display all containers (including stopped ones)
docker stats --all

# Output to file
docker stats --format "{{.Name}},{{.CPUPerc}},{{.MemUsage}}" --no-stream > container_stats.csv

System Event Monitoring

# Monitor Docker events in real-time
docker events

# Filter events for specific container
docker events --filter container=web-server

# Filter by event type
docker events --filter event=start
docker events --filter event=die
docker events --filter event=restart

# Filter by time range
docker events --since="2023-01-01"
docker events --until="2023-01-02"

# Format event output
docker events --format 'Time={{.Time}} Action={{.Action}} Container={{.Actor.Attributes.name}}'

# Save events to file
docker events --since="1h" > docker_events.log

Container Resource Limit Monitoring

# View container resource limits
docker inspect container-name | jq '.[0].HostConfig | {Memory, CpuShares, CpuQuota, CpuPeriod}'

# Run container with resource limits
docker run -d --name limited-container \
    --memory=512m \
    --cpus=0.5 \
    --memory-swap=1g \
    nginx:alpine

# Monitor resource usage against limits
docker stats limited-container

Advanced Monitoring Solutions

cAdvisor Monitoring

# Run cAdvisor container
docker run -d \
    --name=cadvisor \
    --restart=unless-stopped \
    --volume=/:/rootfs:ro \
    --volume=/var/run:/var/run:ro \
    --volume=/sys:/sys:ro \
    --volume=/var/lib/docker/:/var/lib/docker:ro \
    --volume=/dev/disk/:/dev/disk:ro \
    --publish=8080:8080 \
    --detach=true \
    gcr.io/cadvisor/cadvisor:latest

# Access cAdvisor Web interface
# http://localhost:8080

Prometheus + Grafana Monitoring Stack

# docker-compose.monitoring.yml
version: '3.8'

services:
  # Prometheus monitoring system
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    networks:
      - monitoring

  # Grafana visualization
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
    networks:
      - monitoring

  # Node Exporter (host metrics)
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  # cAdvisor (container metrics)
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg
    networks:
      - monitoring

  # AlertManager (alert management)
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    driver: bridge

Prometheus configuration file:

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
        port: 8080

Custom Application Metrics

# Python Flask application integrated with Prometheus
from flask import Flask, Response
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')

@app.before_request
def before_request():
    app.start_time = time.time()

@app.after_request
def after_request(response):
    request_latency = time.time() - app.start_time
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.observe(request_latency)
    return response

@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

@app.route('/')
def home():
    return {'message': 'Hello World', 'timestamp': time.time()}

@app.route('/health')
def health():
    return {'status': 'healthy'}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Log Management

Docker Logging Drivers

# Check current logging driver
docker info | grep "Logging Driver"

# Configure global logging driver
sudo tee /etc/docker/daemon.json << 'EOF'
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3",
    "compress": "true"
  }
}
EOF

# Restart Docker service
sudo systemctl restart docker

# Set logging driver for specific container
docker run -d --name app \
    --log-driver json-file \
    --log-opt max-size=10m \
    --log-opt max-file=3 \
    myapp:latest

# Use syslog driver
docker run -d --name app \
    --log-driver syslog \
    --log-opt syslog-address=tcp://192.168.1.100:514 \
    --log-opt tag="{{.Name}}" \
    myapp:latest

ELK Stack Log System

# docker-compose.elk.yml
version: '3.8'

services:
  # Elasticsearch
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    networks:
      - elk

  # Logstash
  logstash:
    image: docker.elastic.co/logstash/logstash:7.15.0
    container_name: logstash
    ports:
      - "5044:5044"
      - "9600:9600"
    volumes:
      - ./logstash/config:/usr/share/logstash/pipeline:ro
      - ./logstash/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
    depends_on:
      - elasticsearch
    networks:
      - elk

  # Kibana
  kibana:
    image: docker.elastic.co/kibana/kibana:7.15.0
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch
    networks:
      - elk

  # Filebeat (log collector)
  filebeat:
    image: docker.elastic.co/beats/filebeat:7.15.0
    container_name: filebeat
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - logstash
    networks:
      - elk

  # Sample application
  web-app:
    image: nginx:alpine
    container_name: web-app
    ports:
      - "8080:80"
    labels:
      - "co.elastic.logs/enabled=true"
      - "co.elastic.logs/module=nginx"
    networks:
      - elk

volumes:
  elasticsearch_data:

networks:
  elk:
    driver: bridge

Filebeat configuration:

# filebeat/filebeat.yml
filebeat.inputs:
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  processors:
    - add_docker_metadata:
        host: "unix:///var/run/docker.sock"

output.logstash:
  hosts: ["logstash:5044"]

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Structured Application Logging

# Python application structured logging
import logging
import json
import sys
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'level': record.levelname,
            'message': record.getMessage(),
            'module': record.module,
            'function': record.funcName,
            'line': record.lineno
        }

        # Add extra fields
        if hasattr(record, 'user_id'):
            log_entry['user_id'] = record.user_id
        if hasattr(record, 'request_id'):
            log_entry['request_id'] = record.request_id

        return json.dumps(log_entry)

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(message)s',
    stream=sys.stdout
)

logger = logging.getLogger(__name__)
logger.handlers[0].setFormatter(JSONFormatter())

# Usage example
logger.info("User login successful", extra={'user_id': 123, 'request_id': 'req-456'})
logger.error("Database connection failed", extra={'error_code': 'DB001'})

Health Checks

Docker Health Checks

# Define health check in Dockerfile
FROM nginx:alpine

# Install curl for health checks
RUN apk add --no-cache curl

# Define health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost/ || exit 1

# Copy custom health check script
COPY health-check.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/health-check.sh

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD /usr/local/bin/health-check.sh
#!/bin/bash
# health-check.sh
set -e

# Check web service
if ! curl -f http://localhost/ >/dev/null 2>&1; then
    echo "Web service health check failed"
    exit 1
fi

# Check database connection
if ! pg_isready -h db -p 5432 >/dev/null 2>&1; then
    echo "Database health check failed"
    exit 1
fi

# Check Redis connection
if ! redis-cli -h redis ping | grep -q "PONG"; then
    echo "Redis health check failed"
    exit 1
fi

echo "All health checks passed"
exit 0

Health Checks in Compose

version: '3.8'

services:
  web:
    image: nginx:alpine
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/health"]
      interval: 1m30s
      timeout: 10s
      retries: 3
      start_period: 40s

  api:
    build: ./api
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      database:
        condition: service_healthy

  database:
    image: postgres:13
    environment:
      POSTGRES_DB: mydb
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d mydb"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

Health Check Monitoring

# View container health status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

# Check detailed health status
docker inspect --format='{{.State.Health.Status}}' container-name
docker inspect --format='{{range .State.Health.Log}}{{.Output}}{{end}}' container-name

# Display only unhealthy containers
docker ps --filter health=unhealthy

Practical Monitoring Case Study

Complete Monitoring Solution

# Create monitoring project
mkdir docker-monitoring
cd docker-monitoring

# Create directory structure
mkdir -p {prometheus/{rules,targets},grafana/{dashboards,provisioning/{dashboards,datasources}},alertmanager}
# docker-compose.monitoring.yml
version: '3.8'

services:
  # Application service
  web-app:
    build: ./app
    container_name: web-app
    ports:
      - "8000:8000"
    environment:
      - PROMETHEUS_METRICS=true
    labels:
      - "prometheus.scrape=true"
      - "prometheus.port=8000"
      - "prometheus.path=/metrics"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - app
      - monitoring

  # Database
  postgres:
    image: postgres:13
    container_name: postgres
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: apppass
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - app

  # Monitoring stack
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.route-prefix=/'
      - '--web.external-url=http://localhost:9090'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    networks:
      - monitoring

  # Exporters
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    networks:
      - monitoring

volumes:
  postgres_data:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  app:
    driver: bridge
  monitoring:
    driver: bridge

Alert rules configuration:

# prometheus/rules/alerts.yml
groups:
- name: container.rules
  rules:
  - alert: ContainerDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Container {{ $labels.instance }} is down"
      description: "Container {{ $labels.instance }} has been down for more than 1 minute."

  - alert: HighCPUUsage
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.name }}"
      description: "Container {{ $labels.name }} has high CPU usage: {{ $value }}%"

  - alert: HighMemoryUsage
    expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.name }}"
      description: "Container {{ $labels.name }} has high memory usage: {{ $value }}%"

  - alert: ContainerUnhealthy
    expr: container_healthcheck_failures_total > 3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Container {{ $labels.name }} is unhealthy"
      description: "Container {{ $labels.name }} has failed health checks {{ $value }} times."
Monitoring Best Practices
  1. Four Golden Signals: Monitor latency, traffic, errors, and saturation
  2. Layered Monitoring: Infrastructure, platform, application, and business levels
  3. Alerting Strategy: Avoid alert fatigue, set reasonable thresholds and suppression rules
  4. Structured Logs: Use structured logging for easier analysis and querying
  5. Monitoring as Code: Include monitoring configuration in version control
Important Notes
  • The monitoring system itself also needs monitoring to avoid single points of failure
  • Log storage needs to consider disk space and retention policies
  • Sensitive information should not appear in logs
  • Secure transmission and access control of monitoring data

Summary

Through this chapter, you should have mastered:

  • Basic Monitoring: Using Docker built-in tools for container monitoring
  • Advanced Monitoring: Deploying Prometheus+Grafana monitoring stack
  • Log Management: Using ELK Stack for log collection and analysis
  • Health Checks: Configuring container health check mechanisms
  • Monitoring Practices: Establishing a complete monitoring and alerting system

In the next chapter, we will learn about Docker security and best practices to ensure the secure operation of containerized applications.