Chapter 7: Monitoring and Logging
Haiyue
27min
Chapter 7: Monitoring and Logging
Learning Objectives
- Master Prometheus + Grafana monitoring solution
- Learn EFK/ELK log collection and analysis
- Understand Kubernetes native monitoring metrics
- Become proficient in alert configuration and troubleshooting
Key Concepts
Three Pillars of Observability
🔄 正在渲染 Mermaid 图表...
Monitoring Architecture Overview
🔄 正在渲染 Mermaid 图表...
Prometheus Monitoring
Prometheus Architecture
🔄 正在渲染 Mermaid 图表...
Installing Prometheus Stack with Helm
# Add Prometheus community repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (includes Prometheus, Grafana, AlertManager, etc.)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
# Check deployment status
kubectl get pods -n monitoring
# View services
kubectl get svc -n monitoring
Accessing Prometheus and Grafana
# Method 1: Port forwarding
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
# Method 2: Create Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: monitoring-ingress
namespace: monitoring
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx
rules:
- host: prometheus.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-kube-prometheus-prometheus
port:
number: 9090
- host: grafana.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-grafana
port:
number: 80
# Get Grafana default password
kubectl get secret prometheus-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d
# Default username: admin
PromQL Query Language
# Basic queries
# View all available metrics
{__name__=~".+"}
# CPU usage
# Node CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (namespace, pod)
# Memory usage
# Node memory usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Pod memory usage
sum(container_memory_working_set_bytes{container!="POD",container!=""}) by (namespace, pod)
# Network traffic
# Receive bytes rate
sum(rate(container_network_receive_bytes_total[5m])) by (namespace, pod)
# Transmit bytes rate
sum(rate(container_network_transmit_bytes_total[5m])) by (namespace, pod)
# HTTP requests
# Request rate
sum(rate(http_requests_total[5m])) by (service)
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# Average response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Kubernetes-specific metrics
# Pod count
count(kube_pod_info) by (namespace)
# Unready Pods
sum(kube_pod_status_ready{condition="false"}) by (namespace)
# Deployment replica status
kube_deployment_status_replicas_available / kube_deployment_spec_replicas
# PVC utilization
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100
Custom Monitoring Metrics
# Create ServiceMonitor to monitor custom applications
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: monitoring
labels:
release: prometheus # Must match Prometheus serviceMonitorSelector
spec:
selector:
matchLabels:
app: myapp
namespaceSelector:
matchNames:
- default
- production
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
# Application needs to expose metrics port
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: myapp
image: myapp:latest
ports:
- name: http
containerPort: 8080
- name: metrics
containerPort: 9090
---
apiVersion: v1
kind: Service
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
app: myapp
ports:
- name: http
port: 8080
- name: metrics
port: 9090
Alert Rules Configuration
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: pod-alerts
rules:
# Pod restarting too much
- alert: PodRestartingTooMuch
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting too much"
description: "Pod has restarted {{ $value }} times in the past hour"
# Pod not ready
- alert: PodNotReady
expr: kube_pod_status_ready{condition="false"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
description: "Pod has been in non-Ready state for 5 minutes"
- name: resource-alerts
rules:
# High CPU usage
- alert: HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (namespace, pod)
/ sum(kube_pod_container_resource_limits{resource="cpu"}) by (namespace, pod) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} high CPU usage"
description: "CPU usage is at {{ $value | printf \"%.1f\" }}%"
# High memory usage
- alert: HighMemoryUsage
expr: |
sum(container_memory_working_set_bytes{container!="POD",container!=""}) by (namespace, pod)
/ sum(kube_pod_container_resource_limits{resource="memory"}) by (namespace, pod) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} high memory usage"
description: "Memory usage is at {{ $value | printf \"%.1f\" }}%"
# PVC storage almost full
- alert: PVCStorageAlmostFull
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} storage almost full"
description: "Storage usage is at {{ $value | printf \"%.1f\" }}%"
- name: cluster-alerts
rules:
# Node not ready
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} not ready"
description: "Node has been in NotReady state for 5 minutes"
# Node disk pressure
- alert: NodeDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} disk pressure"
description: "Node has insufficient disk space"
AlertManager Configuration
# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-prometheus-kube-prometheus-alertmanager
namespace: monitoring
type: Opaque
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
continue: true
- match:
severity: warning
receiver: 'warning-receiver'
receivers:
- name: 'default-receiver'
email_configs:
- to: 'ops@example.com'
send_resolved: true
- name: 'critical-receiver'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
channel: '#alerts-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'warning-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
channel: '#alerts-warning'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'namespace']
Grafana Visualization
Importing Dashboards
# Common Dashboard IDs
# Kubernetes Cluster Monitoring: 315
# Node Exporter: 1860
# Kubernetes Pod Monitoring: 6417
# NGINX Ingress: 9614
# In Grafana UI:
# 1. Click + -> Import
# 2. Enter Dashboard ID
# 3. Select Prometheus data source
# 4. Click Import
Custom Dashboard
{
"dashboard": {
"title": "My Application Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
]
},
{
"title": "Response Time P95",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{ service }}"
}
]
}
]
}
}
EFK Logging System
EFK Architecture
🔄 正在渲染 Mermaid 图表...
Installing Elasticsearch
# Add Elastic repository
helm repo add elastic https://helm.elastic.co
helm repo update
# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch \
--namespace logging \
--create-namespace \
--set replicas=3 \
--set minimumMasterNodes=2 \
--set resources.requests.memory=2Gi \
--set resources.limits.memory=4Gi \
--set volumeClaimTemplate.resources.requests.storage=50Gi
# Wait for Elasticsearch to be ready
kubectl get pods -n logging -w
Installing Kibana
helm install kibana elastic/kibana \
--namespace logging \
--set elasticsearchHosts="http://elasticsearch-master:9200"
# Create Ingress to access Kibana
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: kibana-ingress
namespace: logging
spec:
ingressClassName: nginx
rules:
- host: kibana.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kibana-kibana
port:
number: 5601
Installing Fluent Bit
helm repo add fluent https://fluent.github.io/helm-charts
helm repo update
helm install fluent-bit fluent/fluent-bit \
--namespace logging \
--set config.outputs="[OUTPUT]\n Name es\n Match *\n Host elasticsearch-master\n Port 9200\n Logstash_Format On\n Retry_Limit False"
Fluent Bit Configuration Details
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name modify
Match *
Add cluster production
Add environment prod
[OUTPUT]
Name es
Match *
Host elasticsearch-master
Port 9200
Logstash_Format On
Logstash_Prefix kubernetes
Retry_Limit False
Replace_Dots On
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
[PARSER]
Name syslog
Format regex
Regex ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
Application Log Configuration
# Configure application to output JSON format logs
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
metadata:
annotations:
# Specify log parser
fluentbit.io/parser: json
# Exclude from log collection
# fluentbit.io/exclude: "true"
spec:
containers:
- name: myapp
image: myapp:latest
env:
- name: LOG_FORMAT
value: "json"
- name: LOG_LEVEL
value: "info"
Kibana Log Queries
# Basic query
kubernetes.namespace_name: "production"
# Error logs
level: "error" AND kubernetes.namespace_name: "production"
# Specific Pod logs
kubernetes.pod_name: "myapp-*"
# Time range
@timestamp:[2024-01-01 TO 2024-01-02]
# Combined query
kubernetes.namespace_name: "production" AND level: "error" AND message: *timeout*
# Exclude query
kubernetes.namespace_name: "production" AND NOT kubernetes.container_name: "sidecar"
Native Monitoring Metrics
Metrics Server
# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify installation
kubectl get deployment metrics-server -n kube-system
# Use kubectl top
kubectl top nodes
kubectl top pods
kubectl top pods -n kube-system --sort-by=memory
kubectl top pods --containers # Show container-level metrics
Resource Usage Queries
# Node resources
kubectl describe node <node-name> | grep -A 5 "Allocated resources"
# Pod resource usage
kubectl get pods -o custom-columns=\
"NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
CPU_LIM:.spec.containers[*].resources.limits.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory,\
MEM_LIM:.spec.containers[*].resources.limits.memory"
# Resource quota usage
kubectl describe resourcequota -n <namespace>
Practical Exercise
Complete Monitoring Solution Deployment
# monitoring-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
# values-prometheus.yaml for Helm installation
# prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
additionalScrapeConfigs:
- job_name: 'custom-apps'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
grafana:
persistence:
enabled: true
size: 10Gi
adminPassword: "secure-password"
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
kubernetes-cluster:
gnetId: 315
revision: 3
datasource: Prometheus
node-exporter:
gnetId: 1860
revision: 27
datasource: Prometheus
# Deploy complete monitoring stack
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
--create-namespace \
-f prometheus-values.yaml
# Verify deployment
kubectl get pods -n monitoring
kubectl get svc -n monitoring
Application Monitoring Example
# Example application: Web service with monitoring
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
containers:
- name: web-app
image: nginx:1.20
ports:
- name: http
containerPort: 80
- name: metrics
containerPort: 9090
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
# Prometheus Exporter Sidecar
- name: nginx-exporter
image: nginx/nginx-prometheus-exporter:0.11
args:
- -nginx.scrape-uri=http://localhost:80/stub_status
ports:
- name: metrics
containerPort: 9113
resources:
requests:
cpu: 10m
memory: 32Mi
limits:
cpu: 50m
memory: 64Mi
---
apiVersion: v1
kind: Service
metadata:
name: web-app
labels:
app: web-app
spec:
selector:
app: web-app
ports:
- name: http
port: 80
- name: metrics
port: 9090
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: web-app-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: web-app
namespaceSelector:
matchNames:
- default
endpoints:
- port: metrics
interval: 15s
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: web-app-alerts
namespace: monitoring
spec:
groups:
- name: web-app
rules:
- alert: WebAppDown
expr: up{job="web-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Web application unavailable"
description: "{{ $labels.instance }} has been unreachable for 1 minute"
- alert: WebAppHighErrorRate
expr: |
sum(rate(nginx_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(nginx_http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Web application error rate too high"
description: "Error rate is at {{ $value | printf \"%.1f\" }}%"
Troubleshooting
Common Troubleshooting Commands
# View Pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -c <container-name> # Multiple containers
kubectl logs <pod-name> --previous # View previous container logs
kubectl logs -f <pod-name> # Follow logs in real-time
kubectl logs --since=1h <pod-name> # Last 1 hour
kubectl logs --tail=100 <pod-name> # Last 100 lines
# View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning
# View Pod details
kubectl describe pod <pod-name> -n <namespace>
# Enter Pod for debugging
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec -it <pod-name> -c <container> -- /bin/bash
# Debug container (requires EphemeralContainers enabled)
kubectl debug <pod-name> -it --image=busybox
# Resource usage
kubectl top pods -n <namespace>
kubectl top nodes
Monitoring System Troubleshooting
# Prometheus troubleshooting
# Check targets status
curl http://prometheus:9090/api/v1/targets
# Check alert status
curl http://prometheus:9090/api/v1/alerts
# Check configuration
kubectl get prometheusrules -A
kubectl describe prometheus prometheus-kube-prometheus-prometheus -n monitoring
# AlertManager troubleshooting
# Check alerts
curl http://alertmanager:9093/api/v2/alerts
# Check silence rules
curl http://alertmanager:9093/api/v2/silences
# Grafana troubleshooting
# Check data sources
kubectl logs deployment/prometheus-grafana -n monitoring
# Logging system troubleshooting
# Elasticsearch health status
curl http://elasticsearch:9200/_cluster/health?pretty
# Index status
curl http://elasticsearch:9200/_cat/indices?v
# Fluent Bit status
kubectl logs daemonset/fluent-bit -n logging
Monitoring Best Practices
- Layered monitoring: Infrastructure → Platform → Application monitoring layers
- Reasonable alerts: Avoid alert fatigue, set appropriate thresholds
- Standardized logging: Unified log format for easy querying and analysis
- Retention policy: Set appropriate data retention period based on needs
- Resource planning: Monitoring system itself needs sufficient resources
- Regular drills: Periodically test if alert notifications work properly
Summary
Through this chapter, you should have mastered:
- Prometheus monitoring: Metric collection, PromQL queries, alert configuration
- Grafana visualization: Dashboard creation and usage
- EFK logging system: Log collection, storage, and querying
- Native monitoring: Metrics Server and kubectl top
- Troubleshooting: Common troubleshooting commands and techniques
In the next chapter, we will learn about Security and RBAC, mastering Kubernetes permission control and security policy configuration.