Chapter 7: Container Monitoring and Logging
9/1/25About 6 min
Chapter 7: Container Monitoring and Logging
Learning Objectives
- Master methods and tools for container performance monitoring
- Learn to collect and analyze container logs
- Understand the container health check mechanism
- Become proficient in using monitoring tools for troubleshooting
Knowledge Points
The Importance of Monitoring
Container monitoring is a key part of ensuring application stability. An effective monitoring system should include:
Monitoring Dimension | Monitoring Metrics | Tool Examples |
---|---|---|
Resource Usage | CPU, Memory, Disk, Network | docker stats, cAdvisor |
Application Performance | Response Time, Throughput, Error Rate | APM tools, custom metrics |
Log Analysis | Application logs, system logs, error logs | ELK Stack, Fluentd |
Health Status | Service availability, dependency checks | Health Check, Probe |
Monitoring Architecture Patterns
Container Monitoring Architecture:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ App Container │ │ App Container │ │ App Container │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌─────────────┐
│ Monitoring Agent │ (cAdvisor, Node Exporter)
└─────────────┘
│
┌─────────────┐
│ Monitoring System │ (Prometheus, Grafana)
└─────────────┘
│
┌─────────────┐
│ Alerting System │ (AlertManager, PagerDuty)
└─────────────┘
Docker Built-in Monitoring
docker stats
Command
# Real-time monitoring of all containers
docker stats
# Monitor specific containers
docker stats web-server api-server
# Formatted output
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}"
# One-time output (no continuous refresh)
docker stats --no-stream
# Show all containers (including stopped ones)
docker stats --all
# Output to a file
docker stats --format "{{.Name}},{{.CPUPerc}},{{.MemUsage}}" --no-stream > container_stats.csv
System Event Monitoring
# Real-time monitoring of Docker events
docker events
# Filter events for a specific container
docker events --filter container=web-server
# Filter by event type
docker events --filter event=start
docker events --filter event=die
docker events --filter event=restart
# Filter by time range
docker events --since="2023-01-01"
docker events --until="2023-01-02"
# Format event output
docker events --format 'Time={{.Time}} Action={{.Action}} Container={{.Actor.Attributes.name}}'
# Save events to a file
docker events --since="1h" > docker_events.log
Container Resource Limit Monitoring
# View container resource limits
docker inspect container-name | jq '.[0].HostConfig | {Memory, CpuShares, CpuQuota, CpuPeriod}'
# Run a container with resource limits
docker run -d --name limited-container \
--memory=512m \
--cpus=0.5 \
--memory-swap=1g \
nginx:alpine
# Monitor if resource usage exceeds limits
docker stats limited-container
Advanced Monitoring Solutions
cAdvisor Monitoring
# Run the cAdvisor container
docker run -d \
--name=cadvisor \
--restart=unless-stopped \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
gcr.io/cadvisor/cadvisor:latest
# Access the cAdvisor web interface
# http://localhost:8080
Prometheus + Grafana Monitoring Stack
# docker-compose.monitoring.yml
version: '3.8'
services:
# Prometheus monitoring system
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
networks:
- monitoring
# Grafana visualization
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_INSTALL_PLUGINS=grafana-piechart-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
networks:
- monitoring
# Node Exporter (host metrics)
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
# cAdvisor (container metrics)
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg
networks:
- monitoring
# AlertManager (alert management)
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
networks:
- monitoring
volumes:
prometheus_data: {}
grafana_data: {}
alertmanager_data: {}
networks:
monitoring:
driver: bridge
Prometheus configuration file:
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'docker-containers'
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
port: 8080
Custom Application Metrics
# Python Flask app integration with Prometheus
from flask import Flask, Response
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time
app = Flask(__name__)
# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')
@app.before_request
def before_request():
app.start_time = time.time()
@app.after_request
def after_request(response):
request_latency = time.time() - app.start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
REQUEST_LATENCY.observe(request_latency)
return response
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
@app.route('/')
def home():
return {'message': 'Hello World', 'timestamp': time.time()}
@app.route('/health')
def health():
return {'status': 'healthy'}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Log Management
Docker Log Drivers
# View the current log driver
docker info | grep "Logging Driver"
# Configure the global log driver
sudo tee /etc/docker/daemon.json << 'EOF'
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3",
"compress": "true"
}
}
EOF
# Restart the Docker service
sudo systemctl restart docker
# Set the log driver for a specific container
docker run -d --name app \
--log-driver json-file \
--log-opt max-size=10m \
--log-opt max-file=3 \
myapp:latest
# Use the syslog driver
docker run -d --name app \
--log-driver syslog \
--log-opt syslog-address=tcp://192.168.1.100:514 \
--log-opt tag="{{.Name}}" \
myapp:latest
ELK Stack Logging System
# docker-compose.elk.yml
version: '3.8'
services:
# Elasticsearch
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
container_name: elasticsearch
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- xpack.security.enabled=false
ports:
- "9200:9200"
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
networks:
- elk
# Logstash
logstash:
image: docker.elastic.co/logstash/logstash:7.15.0
container_name: logstash
ports:
- "5044:5044"
- "9600:9600"
volumes:
- ./logstash/config:/usr/share/logstash/pipeline:ro
- ./logstash/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
depends_on:
- elasticsearch
networks:
- elk
# Kibana
kibana:
image: docker.elastic.co/kibana/kibana:7.15.0
container_name: kibana
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
depends_on:
- elasticsearch
networks:
- elk
# Filebeat (log collector)
filebeat:
image: docker.elastic.co/beats/filebeat:7.15.0
container_name: filebeat
user: root
volumes:
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
depends_on:
- logstash
networks:
- elk
# Example application
web-app:
image: nginx:alpine
container_name: web-app
ports:
- "8080:80"
labels:
- "co.elastic.logs/enabled=true"
- "co.elastic.logs/module=nginx"
networks:
- elk
volumes:
elasticsearch_data: {}
networks:
elk:
driver: bridge
Filebeat configuration:
# filebeat/filebeat.yml
filebeat.inputs:
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
processors:
- add_docker_metadata:
host: "unix:///var/run/docker.sock"
output.logstash:
hosts: ["logstash:5044"]
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
Application Log Structuring
# Python application structured logging
import logging
import json
import sys
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'level': record.levelname,
'message': record.getMessage(),
'module': record.module,
'function': record.funcName,
'line': record.lineno
}
# Add extra fields
if hasattr(record, 'user_id'):
log_entry['user_id'] = record.user_id
if hasattr(record, 'request_id'):
log_entry['request_id'] = record.request_id
return json.dumps(log_entry)
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(message)s',
stream=sys.stdout
)
logger = logging.getLogger(__name__)
logger.handlers[0].setFormatter(JSONFormatter())
# Usage example
logger.info("User login successful", extra={'user_id': 123, 'request_id': 'req-456'})
logger.error("Database connection failed", extra={'error_code': 'DB001'})
Health Checks
Docker Health Checks
# Define a health check in a Dockerfile
FROM nginx:alpine
# Install curl for health checks
RUN apk add --no-cache curl
# Define the health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost/ || exit 1
# Copy a custom health check script
COPY health-check.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/health-check.sh
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD /usr/local/bin/health-check.sh
#!/bin/bash
# health-check.sh
set -e
# Check the web service
if ! curl -f http://localhost/ >/dev/null 2>&1; then
echo "Web service health check failed"
exit 1
fi
# Check the database connection
if ! pg_isready -h db -p 5432 >/dev/null 2>&1; then
echo "Database health check failed"
exit 1
fi
# Check the Redis connection
if ! redis-cli -h redis ping | grep -q "PONG"; then
echo "Redis health check failed"
exit 1
fi
echo "All health checks passed"
exit 0
Health Checks in Compose
version: '3.8'
services:
web:
image: nginx:alpine
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/health"]
interval: 1m30s
timeout: 10s
retries: 3
start_period: 40s
api:
build: ./api
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
interval: 30s
timeout: 5s
retries: 3
depends_on:
database:
condition: service_healthy
database:
image: postgres:13
environment:
POSTGRES_DB: mydb
POSTGRES_USER: user
POSTGRES_PASSWORD: password
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user -d mydb"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
Health Check Monitoring
# View container health status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
# Check health status details
docker inspect --format='{{.State.Health.Status}}' container-name
docker inspect --format='{{range .State.Health.Log}}{{.Output}}{{end}}' container-name
# Show only unhealthy containers
docker ps --filter health=unhealthy
Practical Monitoring Case
Complete Monitoring Solution
# Create a monitoring project
mkdir docker-monitoring
cd docker-monitoring
# Create directory structure
mkdir -p {prometheus/{rules,targets},grafana/{dashboards,provisioning/{dashboards,datasources}},alertmanager}
# docker-compose.monitoring.yml
version: '3.8'
services:
# Application service
web-app:
build: ./app
container_name: web-app
ports:
- "8000:8000"
environment:
- PROMETHEUS_METRICS=true
labels:
- "prometheus.scrape=true"
- "prometheus.port=8000"
- "prometheus.path=/metrics"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
networks:
- app
- monitoring
# Database
postgres:
image: postgres:13
container_name: postgres
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD: apppass
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 10s
timeout: 5s
retries: 5
networks:
- app
# Monitoring stack
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
- '--web.route-prefix=/'
- '--web.external-url=http://localhost:9090'
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
networks:
- monitoring
# Exporters
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
networks:
- monitoring
volumes:
postgres_data: {}
prometheus_data: {}
grafana_data: {}
alertmanager_data: {}
networks:
app:
driver: bridge
monitoring:
driver: bridge
Alerting rules configuration:
# prometheus/rules/alerts.yml
groups:
- name: container.rules
rules:
- alert: ContainerDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.instance }} is down"
description: "Container {{ $labels.instance }} has been down for more than 1 minute."
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.name }}"
description: "Container {{ $labels.name }} has high CPU usage: {{ $value }}%"
- alert: HighMemoryUsage
expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.name }}"
description: "Container {{ $labels.name }} has high memory usage: {{ $value }}%"
- alert: ContainerUnhealthy
expr: container_healthcheck_failures_total > 3
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is unhealthy"
description: "Container {{ $labels.name }} has failed health checks {{ $value }} times."
Monitoring Best Practices
- The Four Golden Signals: Monitor latency, traffic, errors, and saturation.
- Layered Monitoring: Infrastructure, platform, application, and business levels.
- Alerting Strategy: Avoid alert fatigue, set reasonable thresholds and suppression rules.
- Structured Logging: Use structured logs for easy analysis and querying.
- Monitoring as Code: Keep monitoring configurations under version control.
Important Notes
- The monitoring system itself also needs to be monitored to avoid single points of failure.
- Log storage needs to consider disk space and retention policies.
- Sensitive information should not appear in logs.
- Secure transmission and access control of monitoring data.
Summary
By completing this chapter, you should have mastered:
- Basic Monitoring: Using Docker's built-in tools for container monitoring.
- Advanced Monitoring: Deploying a Prometheus+Grafana monitoring stack.
- Log Management: Using the ELK Stack for log collection and analysis.
- Health Checks: Configuring container health check mechanisms.
- Monitoring Practices: Establishing a complete monitoring and alerting system.
In the next chapter, we will learn about Docker security and best practices to ensure the secure operation of containerized applications.