Chapter 7: Monitoring and Logging

Haiyue
27min

Chapter 7: Monitoring and Logging

Learning Objectives
  • Master Prometheus + Grafana monitoring solution
  • Learn EFK/ELK log collection and analysis
  • Understand Kubernetes native monitoring metrics
  • Become proficient in alert configuration and troubleshooting

Key Concepts

Three Pillars of Observability

🔄 正在渲染 Mermaid 图表...

Monitoring Architecture Overview

🔄 正在渲染 Mermaid 图表...

Prometheus Monitoring

Prometheus Architecture

🔄 正在渲染 Mermaid 图表...

Installing Prometheus Stack with Helm

# Add Prometheus community repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (includes Prometheus, Grafana, AlertManager, etc.)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

# Check deployment status
kubectl get pods -n monitoring

# View services
kubectl get svc -n monitoring

Accessing Prometheus and Grafana

# Method 1: Port forwarding
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring

# Method 2: Create Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: monitoring-ingress
  namespace: monitoring
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: nginx
  rules:
  - host: prometheus.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-kube-prometheus-prometheus
            port:
              number: 9090
  - host: grafana.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-grafana
            port:
              number: 80
# Get Grafana default password
kubectl get secret prometheus-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d
# Default username: admin

PromQL Query Language

# Basic queries
# View all available metrics
{__name__=~".+"}

# CPU usage
# Node CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (namespace, pod)

# Memory usage
# Node memory usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Pod memory usage
sum(container_memory_working_set_bytes{container!="POD",container!=""}) by (namespace, pod)

# Network traffic
# Receive bytes rate
sum(rate(container_network_receive_bytes_total[5m])) by (namespace, pod)

# Transmit bytes rate
sum(rate(container_network_transmit_bytes_total[5m])) by (namespace, pod)

# HTTP requests
# Request rate
sum(rate(http_requests_total[5m])) by (service)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Average response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# Kubernetes-specific metrics
# Pod count
count(kube_pod_info) by (namespace)

# Unready Pods
sum(kube_pod_status_ready{condition="false"}) by (namespace)

# Deployment replica status
kube_deployment_status_replicas_available / kube_deployment_spec_replicas

# PVC utilization
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100

Custom Monitoring Metrics

# Create ServiceMonitor to monitor custom applications
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  namespace: monitoring
  labels:
    release: prometheus  # Must match Prometheus serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: myapp
  namespaceSelector:
    matchNames:
    - default
    - production
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    scheme: http
# Application needs to expose metrics port
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: myapp
        image: myapp:latest
        ports:
        - name: http
          containerPort: 8080
        - name: metrics
          containerPort: 9090
---
apiVersion: v1
kind: Service
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  selector:
    app: myapp
  ports:
  - name: http
    port: 8080
  - name: metrics
    port: 9090

Alert Rules Configuration

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: custom-alerts
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
  - name: pod-alerts
    rules:
    # Pod restarting too much
    - alert: PodRestartingTooMuch
      expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting too much"
        description: "Pod has restarted {{ $value }} times in the past hour"

    # Pod not ready
    - alert: PodNotReady
      expr: kube_pod_status_ready{condition="false"} == 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
        description: "Pod has been in non-Ready state for 5 minutes"

  - name: resource-alerts
    rules:
    # High CPU usage
    - alert: HighCPUUsage
      expr: |
        sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (namespace, pod)
        / sum(kube_pod_container_resource_limits{resource="cpu"}) by (namespace, pod) * 100 > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} high CPU usage"
        description: "CPU usage is at {{ $value | printf \"%.1f\" }}%"

    # High memory usage
    - alert: HighMemoryUsage
      expr: |
        sum(container_memory_working_set_bytes{container!="POD",container!=""}) by (namespace, pod)
        / sum(kube_pod_container_resource_limits{resource="memory"}) by (namespace, pod) * 100 > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} high memory usage"
        description: "Memory usage is at {{ $value | printf \"%.1f\" }}%"

    # PVC storage almost full
    - alert: PVCStorageAlmostFull
      expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PVC {{ $labels.persistentvolumeclaim }} storage almost full"
        description: "Storage usage is at {{ $value | printf \"%.1f\" }}%"

  - name: cluster-alerts
    rules:
    # Node not ready
    - alert: NodeNotReady
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Node {{ $labels.node }} not ready"
        description: "Node has been in NotReady state for 5 minutes"

    # Node disk pressure
    - alert: NodeDiskPressure
      expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node {{ $labels.node }} disk pressure"
        description: "Node has insufficient disk space"

AlertManager Configuration

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-prometheus-kube-prometheus-alertmanager
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_auth_username: 'alertmanager@example.com'
      smtp_auth_password: 'password'

    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: 'default-receiver'
      routes:
      - match:
          severity: critical
        receiver: 'critical-receiver'
        continue: true
      - match:
          severity: warning
        receiver: 'warning-receiver'

    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: 'ops@example.com'
        send_resolved: true

    - name: 'critical-receiver'
      email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts-critical'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

    - name: 'warning-receiver'
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts-warning'
        send_resolved: true

    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'namespace']

Grafana Visualization

Importing Dashboards

# Common Dashboard IDs
# Kubernetes Cluster Monitoring: 315
# Node Exporter: 1860
# Kubernetes Pod Monitoring: 6417
# NGINX Ingress: 9614

# In Grafana UI:
# 1. Click + -> Import
# 2. Enter Dashboard ID
# 3. Select Prometheus data source
# 4. Click Import

Custom Dashboard

{
  "dashboard": {
    "title": "My Application Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ]
      },
      {
        "title": "Response Time P95",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{ service }}"
          }
        ]
      }
    ]
  }
}

EFK Logging System

EFK Architecture

🔄 正在渲染 Mermaid 图表...

Installing Elasticsearch

# Add Elastic repository
helm repo add elastic https://helm.elastic.co
helm repo update

# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --create-namespace \
  --set replicas=3 \
  --set minimumMasterNodes=2 \
  --set resources.requests.memory=2Gi \
  --set resources.limits.memory=4Gi \
  --set volumeClaimTemplate.resources.requests.storage=50Gi

# Wait for Elasticsearch to be ready
kubectl get pods -n logging -w

Installing Kibana

helm install kibana elastic/kibana \
  --namespace logging \
  --set elasticsearchHosts="http://elasticsearch-master:9200"

# Create Ingress to access Kibana
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: kibana-ingress
  namespace: logging
spec:
  ingressClassName: nginx
  rules:
  - host: kibana.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: kibana-kibana
            port:
              number: 5601

Installing Fluent Bit

helm repo add fluent https://fluent.github.io/helm-charts
helm repo update

helm install fluent-bit fluent/fluent-bit \
  --namespace logging \
  --set config.outputs="[OUTPUT]\n    Name es\n    Match *\n    Host elasticsearch-master\n    Port 9200\n    Logstash_Format On\n    Retry_Limit False"

Fluent Bit Configuration Details

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On
        Refresh_Interval  10

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [FILTER]
        Name          modify
        Match         *
        Add           cluster production
        Add           environment prod

    [OUTPUT]
        Name            es
        Match           *
        Host            elasticsearch-master
        Port            9200
        Logstash_Format On
        Logstash_Prefix kubernetes
        Retry_Limit     False
        Replace_Dots    On

  parsers.conf: |
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        Name        syslog
        Format      regex
        Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key    time
        Time_Format %b %d %H:%M:%S

Application Log Configuration

# Configure application to output JSON format logs
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    metadata:
      annotations:
        # Specify log parser
        fluentbit.io/parser: json
        # Exclude from log collection
        # fluentbit.io/exclude: "true"
    spec:
      containers:
      - name: myapp
        image: myapp:latest
        env:
        - name: LOG_FORMAT
          value: "json"
        - name: LOG_LEVEL
          value: "info"

Kibana Log Queries

# Basic query
kubernetes.namespace_name: "production"

# Error logs
level: "error" AND kubernetes.namespace_name: "production"

# Specific Pod logs
kubernetes.pod_name: "myapp-*"

# Time range
@timestamp:[2024-01-01 TO 2024-01-02]

# Combined query
kubernetes.namespace_name: "production" AND level: "error" AND message: *timeout*

# Exclude query
kubernetes.namespace_name: "production" AND NOT kubernetes.container_name: "sidecar"

Native Monitoring Metrics

Metrics Server

# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify installation
kubectl get deployment metrics-server -n kube-system

# Use kubectl top
kubectl top nodes
kubectl top pods
kubectl top pods -n kube-system --sort-by=memory
kubectl top pods --containers  # Show container-level metrics

Resource Usage Queries

# Node resources
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

# Pod resource usage
kubectl get pods -o custom-columns=\
"NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
CPU_LIM:.spec.containers[*].resources.limits.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory,\
MEM_LIM:.spec.containers[*].resources.limits.memory"

# Resource quota usage
kubectl describe resourcequota -n <namespace>

Practical Exercise

Complete Monitoring Solution Deployment

# monitoring-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
# values-prometheus.yaml for Helm installation
# prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi
    additionalScrapeConfigs:
    - job_name: 'custom-apps'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

grafana:
  persistence:
    enabled: true
    size: 10Gi
  adminPassword: "secure-password"
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/default
  dashboards:
    default:
      kubernetes-cluster:
        gnetId: 315
        revision: 3
        datasource: Prometheus
      node-exporter:
        gnetId: 1860
        revision: 27
        datasource: Prometheus
# Deploy complete monitoring stack
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring \
  --create-namespace \
  -f prometheus-values.yaml

# Verify deployment
kubectl get pods -n monitoring
kubectl get svc -n monitoring

Application Monitoring Example

# Example application: Web service with monitoring
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: web-app
        image: nginx:1.20
        ports:
        - name: http
          containerPort: 80
        - name: metrics
          containerPort: 9090
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
        livenessProbe:
          httpGet:
            path: /health
            port: http
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: http
          initialDelaySeconds: 5
          periodSeconds: 5
      # Prometheus Exporter Sidecar
      - name: nginx-exporter
        image: nginx/nginx-prometheus-exporter:0.11
        args:
        - -nginx.scrape-uri=http://localhost:80/stub_status
        ports:
        - name: metrics
          containerPort: 9113
        resources:
          requests:
            cpu: 10m
            memory: 32Mi
          limits:
            cpu: 50m
            memory: 64Mi
---
apiVersion: v1
kind: Service
metadata:
  name: web-app
  labels:
    app: web-app
spec:
  selector:
    app: web-app
  ports:
  - name: http
    port: 80
  - name: metrics
    port: 9090
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: web-app-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: web-app
  namespaceSelector:
    matchNames:
    - default
  endpoints:
  - port: metrics
    interval: 15s
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: web-app-alerts
  namespace: monitoring
spec:
  groups:
  - name: web-app
    rules:
    - alert: WebAppDown
      expr: up{job="web-app"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Web application unavailable"
        description: "{{ $labels.instance }} has been unreachable for 1 minute"

    - alert: WebAppHighErrorRate
      expr: |
        sum(rate(nginx_http_requests_total{status=~"5.."}[5m]))
        / sum(rate(nginx_http_requests_total[5m])) * 100 > 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Web application error rate too high"
        description: "Error rate is at {{ $value | printf \"%.1f\" }}%"

Troubleshooting

Common Troubleshooting Commands

# View Pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -c <container-name>  # Multiple containers
kubectl logs <pod-name> --previous  # View previous container logs
kubectl logs -f <pod-name>  # Follow logs in real-time
kubectl logs --since=1h <pod-name>  # Last 1 hour
kubectl logs --tail=100 <pod-name>  # Last 100 lines

# View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning

# View Pod details
kubectl describe pod <pod-name> -n <namespace>

# Enter Pod for debugging
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec -it <pod-name> -c <container> -- /bin/bash

# Debug container (requires EphemeralContainers enabled)
kubectl debug <pod-name> -it --image=busybox

# Resource usage
kubectl top pods -n <namespace>
kubectl top nodes

Monitoring System Troubleshooting

# Prometheus troubleshooting
# Check targets status
curl http://prometheus:9090/api/v1/targets

# Check alert status
curl http://prometheus:9090/api/v1/alerts

# Check configuration
kubectl get prometheusrules -A
kubectl describe prometheus prometheus-kube-prometheus-prometheus -n monitoring

# AlertManager troubleshooting
# Check alerts
curl http://alertmanager:9093/api/v2/alerts

# Check silence rules
curl http://alertmanager:9093/api/v2/silences

# Grafana troubleshooting
# Check data sources
kubectl logs deployment/prometheus-grafana -n monitoring

# Logging system troubleshooting
# Elasticsearch health status
curl http://elasticsearch:9200/_cluster/health?pretty

# Index status
curl http://elasticsearch:9200/_cat/indices?v

# Fluent Bit status
kubectl logs daemonset/fluent-bit -n logging
Monitoring Best Practices
  1. Layered monitoring: Infrastructure → Platform → Application monitoring layers
  2. Reasonable alerts: Avoid alert fatigue, set appropriate thresholds
  3. Standardized logging: Unified log format for easy querying and analysis
  4. Retention policy: Set appropriate data retention period based on needs
  5. Resource planning: Monitoring system itself needs sufficient resources
  6. Regular drills: Periodically test if alert notifications work properly

Summary

Through this chapter, you should have mastered:

  • Prometheus monitoring: Metric collection, PromQL queries, alert configuration
  • Grafana visualization: Dashboard creation and usage
  • EFK logging system: Log collection, storage, and querying
  • Native monitoring: Metrics Server and kubectl top
  • Troubleshooting: Common troubleshooting commands and techniques

In the next chapter, we will learn about Security and RBAC, mastering Kubernetes permission control and security policy configuration.