Chapter 7: Container Monitoring and Logging

Haiyue

September 1, 2025

21min

Chapter 7: Container Monitoring and Logging

Learning Objectives

Master container performance monitoring methods and tools
Learn to collect and analyze container logs
Understand container health check mechanisms
Proficiently use monitoring tools for troubleshooting

Knowledge Points

Importance of Monitoring

Container monitoring is a critical component for ensuring stable application operations. An effective monitoring system should include:

Monitoring Dimension	Monitoring Metrics	Tool Examples
Resource Usage	CPU, Memory, Disk, Network	docker stats, cAdvisor
Application Performance	Response Time, Throughput, Error Rate	APM Tools, Custom Metrics
Log Analysis	Application Logs, System Logs, Error Logs	ELK Stack, Fluentd
Health Status	Service Availability, Dependency Checks	Health Check, Probe

Monitoring Architecture Patterns

Container Monitoring Architecture:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   App Container   │    │   App Container   │    │   App Container   │
└─────────────┘    └─────────────┘    └─────────────┘
        │                   │                   │
        └───────────────────┼───────────────────┘
                           │
                ┌─────────────┐
                │  Monitoring Agent    │ (cAdvisor, Node Exporter)
                └─────────────┘
                           │
                ┌─────────────┐
                │  Monitoring System    │ (Prometheus, Grafana)
                └─────────────┘
                           │
                ┌─────────────┐
                │  Alerting System    │ (AlertManager, PagerDuty)
                └─────────────┘

Docker Built-in Monitoring

docker stats Command

# Monitor all containers in real-time
docker stats

# Monitor specific containers
docker stats web-server api-server

# Formatted output
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}"

# One-time output (no continuous refresh)
docker stats --no-stream

# Display all containers (including stopped ones)
docker stats --all

# Output to file
docker stats --format "{{.Name}},{{.CPUPerc}},{{.MemUsage}}" --no-stream > container_stats.csv

System Event Monitoring

# Monitor Docker events in real-time
docker events

# Filter events for specific container
docker events --filter container=web-server

# Filter by event type
docker events --filter event=start
docker events --filter event=die
docker events --filter event=restart

# Filter by time range
docker events --since="2023-01-01"
docker events --until="2023-01-02"

# Format event output
docker events --format 'Time={{.Time}} Action={{.Action}} Container={{.Actor.Attributes.name}}'

# Save events to file
docker events --since="1h" > docker_events.log

Container Resource Limit Monitoring

# View container resource limits
docker inspect container-name | jq '.[0].HostConfig | {Memory, CpuShares, CpuQuota, CpuPeriod}'

# Run container with resource limits
docker run -d --name limited-container \
    --memory=512m \
    --cpus=0.5 \
    --memory-swap=1g \
    nginx:alpine

# Monitor resource usage against limits
docker stats limited-container

Advanced Monitoring Solutions

cAdvisor Monitoring

# Run cAdvisor container
docker run -d \
    --name=cadvisor \
    --restart=unless-stopped \
    --volume=/:/rootfs:ro \
    --volume=/var/run:/var/run:ro \
    --volume=/sys:/sys:ro \
    --volume=/var/lib/docker/:/var/lib/docker:ro \
    --volume=/dev/disk/:/dev/disk:ro \
    --publish=8080:8080 \
    --detach=true \
    gcr.io/cadvisor/cadvisor:latest

# Access cAdvisor Web interface
# http://localhost:8080

Prometheus + Grafana Monitoring Stack

# docker-compose.monitoring.yml
version: '3.8'

services:
  # Prometheus monitoring system
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    networks:
      - monitoring

  # Grafana visualization
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
    networks:
      - monitoring

  # Node Exporter (host metrics)
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  # cAdvisor (container metrics)
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg
    networks:
      - monitoring

  # AlertManager (alert management)
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    driver: bridge

Prometheus configuration file:

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
        port: 8080

Custom Application Metrics

# Python Flask application integrated with Prometheus
from flask import Flask, Response
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')

@app.before_request
def before_request():
    app.start_time = time.time()

@app.after_request
def after_request(response):
    request_latency = time.time() - app.start_time
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.observe(request_latency)
    return response

@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

@app.route('/')
def home():
    return {'message': 'Hello World', 'timestamp': time.time()}

@app.route('/health')
def health():
    return {'status': 'healthy'}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Log Management

Docker Logging Drivers

# Check current logging driver
docker info | grep "Logging Driver"

# Configure global logging driver
sudo tee /etc/docker/daemon.json << 'EOF'
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3",
    "compress": "true"
  }
}
EOF

# Restart Docker service
sudo systemctl restart docker

# Set logging driver for specific container
docker run -d --name app \
    --log-driver json-file \
    --log-opt max-size=10m \
    --log-opt max-file=3 \
    myapp:latest

# Use syslog driver
docker run -d --name app \
    --log-driver syslog \
    --log-opt syslog-address=tcp://192.168.1.100:514 \
    --log-opt tag="{{.Name}}" \
    myapp:latest

ELK Stack Log System

# docker-compose.elk.yml
version: '3.8'

services:
  # Elasticsearch
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    networks:
      - elk

  # Logstash
  logstash:
    image: docker.elastic.co/logstash/logstash:7.15.0
    container_name: logstash
    ports:
      - "5044:5044"
      - "9600:9600"
    volumes:
      - ./logstash/config:/usr/share/logstash/pipeline:ro
      - ./logstash/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
    depends_on:
      - elasticsearch
    networks:
      - elk

  # Kibana
  kibana:
    image: docker.elastic.co/kibana/kibana:7.15.0
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch
    networks:
      - elk

  # Filebeat (log collector)
  filebeat:
    image: docker.elastic.co/beats/filebeat:7.15.0
    container_name: filebeat
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - logstash
    networks:
      - elk

  # Sample application
  web-app:
    image: nginx:alpine
    container_name: web-app
    ports:
      - "8080:80"
    labels:
      - "co.elastic.logs/enabled=true"
      - "co.elastic.logs/module=nginx"
    networks:
      - elk

volumes:
  elasticsearch_data:

networks:
  elk:
    driver: bridge

Filebeat configuration:

# filebeat/filebeat.yml
filebeat.inputs:
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  processors:
    - add_docker_metadata:
        host: "unix:///var/run/docker.sock"

output.logstash:
  hosts: ["logstash:5044"]

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Structured Application Logging

# Python application structured logging
import logging
import json
import sys
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'level': record.levelname,
            'message': record.getMessage(),
            'module': record.module,
            'function': record.funcName,
            'line': record.lineno
        }

        # Add extra fields
        if hasattr(record, 'user_id'):
            log_entry['user_id'] = record.user_id
        if hasattr(record, 'request_id'):
            log_entry['request_id'] = record.request_id

        return json.dumps(log_entry)

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(message)s',
    stream=sys.stdout
)

logger = logging.getLogger(__name__)
logger.handlers[0].setFormatter(JSONFormatter())

# Usage example
logger.info("User login successful", extra={'user_id': 123, 'request_id': 'req-456'})
logger.error("Database connection failed", extra={'error_code': 'DB001'})

Health Checks

Docker Health Checks

# Define health check in Dockerfile
FROM nginx:alpine

# Install curl for health checks
RUN apk add --no-cache curl

# Define health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost/ || exit 1

# Copy custom health check script
COPY health-check.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/health-check.sh

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD /usr/local/bin/health-check.sh

#!/bin/bash
# health-check.sh
set -e

# Check web service
if ! curl -f http://localhost/ >/dev/null 2>&1; then
    echo "Web service health check failed"
    exit 1
fi

# Check database connection
if ! pg_isready -h db -p 5432 >/dev/null 2>&1; then
    echo "Database health check failed"
    exit 1
fi

# Check Redis connection
if ! redis-cli -h redis ping | grep -q "PONG"; then
    echo "Redis health check failed"
    exit 1
fi

echo "All health checks passed"
exit 0

Health Checks in Compose

version: '3.8'

services:
  web:
    image: nginx:alpine
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/health"]
      interval: 1m30s
      timeout: 10s
      retries: 3
      start_period: 40s

  api:
    build: ./api
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      database:
        condition: service_healthy

  database:
    image: postgres:13
    environment:
      POSTGRES_DB: mydb
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d mydb"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

Health Check Monitoring

# View container health status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

# Check detailed health status
docker inspect --format='{{.State.Health.Status}}' container-name
docker inspect --format='{{range .State.Health.Log}}{{.Output}}{{end}}' container-name

# Display only unhealthy containers
docker ps --filter health=unhealthy

Practical Monitoring Case Study

Complete Monitoring Solution

# Create monitoring project
mkdir docker-monitoring
cd docker-monitoring

# Create directory structure
mkdir -p {prometheus/{rules,targets},grafana/{dashboards,provisioning/{dashboards,datasources}},alertmanager}

# docker-compose.monitoring.yml
version: '3.8'

services:
  # Application service
  web-app:
    build: ./app
    container_name: web-app
    ports:
      - "8000:8000"
    environment:
      - PROMETHEUS_METRICS=true
    labels:
      - "prometheus.scrape=true"
      - "prometheus.port=8000"
      - "prometheus.path=/metrics"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - app
      - monitoring

  # Database
  postgres:
    image: postgres:13
    container_name: postgres
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: apppass
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - app

  # Monitoring stack
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.route-prefix=/'
      - '--web.external-url=http://localhost:9090'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    networks:
      - monitoring

  # Exporters
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    networks:
      - monitoring

volumes:
  postgres_data:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  app:
    driver: bridge
  monitoring:
    driver: bridge

Alert rules configuration:

# prometheus/rules/alerts.yml
groups:
- name: container.rules
  rules:
  - alert: ContainerDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Container {{ $labels.instance }} is down"
      description: "Container {{ $labels.instance }} has been down for more than 1 minute."

  - alert: HighCPUUsage
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.name }}"
      description: "Container {{ $labels.name }} has high CPU usage: {{ $value }}%"

  - alert: HighMemoryUsage
    expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.name }}"
      description: "Container {{ $labels.name }} has high memory usage: {{ $value }}%"

  - alert: ContainerUnhealthy
    expr: container_healthcheck_failures_total > 3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Container {{ $labels.name }} is unhealthy"
      description: "Container {{ $labels.name }} has failed health checks {{ $value }} times."

Monitoring Best Practices

Four Golden Signals: Monitor latency, traffic, errors, and saturation
Layered Monitoring: Infrastructure, platform, application, and business levels
Alerting Strategy: Avoid alert fatigue, set reasonable thresholds and suppression rules
Structured Logs: Use structured logging for easier analysis and querying
Monitoring as Code: Include monitoring configuration in version control

Important Notes

The monitoring system itself also needs monitoring to avoid single points of failure
Log storage needs to consider disk space and retention policies
Sensitive information should not appear in logs
Secure transmission and access control of monitoring data

Summary

Through this chapter, you should have mastered:

Basic Monitoring: Using Docker built-in tools for container monitoring
Advanced Monitoring: Deploying Prometheus+Grafana monitoring stack
Log Management: Using ELK Stack for log collection and analysis
Health Checks: Configuring container health check mechanisms
Monitoring Practices: Establishing a complete monitoring and alerting system

In the next chapter, we will learn about Docker security and best practices to ensure the secure operation of containerized applications.

Apache Airflow 3.x

Chapter 1: Introduction to Apache Airflow 3.x

Chapter 2: Core Concepts and Architecture

Chapter 3: Writing Your First DAG

Chapter 4: Built-in Operators and Sensors

Apache Airflow 3.x

Chapter 1: Introduction to Apache Airflow 3.x

Chapter 2: Core Concepts and Architecture

Chapter 3: Writing Your First DAG

Chapter 4: Built-in Operators and Sensors

Chapter 7: Container Monitoring and Logging

Chapter 7: Container Monitoring and Logging

Knowledge Points

Importance of Monitoring

Monitoring Architecture Patterns

Docker Built-in Monitoring

docker stats Command

System Event Monitoring

Container Resource Limit Monitoring

Advanced Monitoring Solutions

cAdvisor Monitoring

Prometheus + Grafana Monitoring Stack

Custom Application Metrics

Log Management

Docker Logging Drivers

ELK Stack Log System

Structured Application Logging

Health Checks

Docker Health Checks

Health Checks in Compose

Health Check Monitoring

Practical Monitoring Case Study

Complete Monitoring Solution

Summary