Chapter 10: Production Environment Best Practices

Summary of Argo ecosystem best practices in production environments, including high availability deployment, performance optimization, troubleshooting, and operations guide

作者
33min

Production Environment Best Practices

Chapter 10: Argo Production Environment Operations Guide

This chapter will summarize the best practices for the Argo ecosystem in production environments, helping you build a stable, efficient, and maintainable cloud-native CI/CD platform.

10.1 High Availability Deployment

10.1.1 Argo CD High Availability Architecture

🔄 正在渲染 Mermaid 图表...

10.1.2 High Availability Deployment Configuration

# argocd-ha-install.yaml
# High Availability Argo CD Deployment
# kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

# Custom high availability configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
  namespace: argo
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-server
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/name: argocd-server
            topologyKey: kubernetes.io/hostname
      containers:
      - name: argocd-server
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 2
            memory: 2Gi
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 30
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 30
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argo
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-repo-server
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/name: argocd-repo-server
            topologyKey: kubernetes.io/hostname
      containers:
      - name: argocd-repo-server
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 2
            memory: 2Gi
---
# Application Controller uses StatefulSet for sharding
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
  namespace: argo
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-application-controller
  template:
    spec:
      containers:
      - name: argocd-application-controller
        env:
        # Enable controller sharding
        - name: ARGOCD_CONTROLLER_REPLICAS
          value: "2"
        resources:
          requests:
            cpu: 1
            memory: 1Gi
          limits:
            cpu: 4
            memory: 4Gi
---
# Redis high availability
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-redis-ha
  namespace: argo
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-redis-ha
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/name: argocd-redis-ha
            topologyKey: kubernetes.io/hostname
      containers:
      - name: redis
        image: redis:7-alpine
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi

10.1.3 Argo Workflows High Availability

# workflows-ha.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: workflow-controller
  namespace: argo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: workflow-controller
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: workflow-controller
              topologyKey: kubernetes.io/hostname
      containers:
      - name: workflow-controller
        args:
        - --configmap
        - workflow-controller-configmap
        - --executor-image
        - quay.io/argoproj/argoexec:latest
        # Leader election
        - --leader-elect
        env:
        - name: LEADER_ELECTION_IDENTITY
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 2
            memory: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argo-server
  namespace: argo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: argo-server
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: argo-server
              topologyKey: kubernetes.io/hostname
      containers:
      - name: argo-server
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 1
            memory: 1Gi

10.1.4 EventBus High Availability

# eventbus-ha.yaml
apiVersion: argoproj.io/v1alpha1
kind: EventBus
metadata:
  name: default
  namespace: argo-events
spec:
  jetstream:
    version: "2.9.15"
    replicas: 3
    persistence:
      storageClassName: fast-ssd
      accessMode: ReadWriteOnce
      volumeSize: 20Gi
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              controller: eventbus-controller
          topologyKey: kubernetes.io/hostname
    containerTemplate:
      resources:
        requests:
          cpu: 500m
          memory: 512Mi
        limits:
          cpu: 2
          memory: 2Gi
    streamConfig: |
      max_msgs: 1000000
      max_bytes: 10737418240
      max_age: 168h
      replicas: 3
      retention: limits
      storage: file

10.2 Performance Optimization

10.2.1 Argo CD Performance Tuning

# argocd-performance-tuning.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argo
data:
  # Parallel sync limit
  controller.sync.concurrency.limit: "50"

  # Status cache
  controller.status.processors: "20"
  controller.operation.processors: "10"

  # Repo server parallelism limit
  reposerver.parallelism.limit: "200"

  # Timeout settings
  timeout.reconciliation: "180s"
  timeout.hard.reconciliation: "0"

  # Resource health check interval
  resource.healthcheck.interval: "10s"

  # Resource comparison options
  resource.compareoptions: |
    ignoreAggregatedRoles: true
    ignoreResourceStatusField: all

  # Exclude resources that don't need tracking
  resource.exclusions: |
    - apiGroups:
      - "cilium.io"
      kinds:
      - CiliumIdentity
      clusters:
      - "*"
    - apiGroups:
      - ""
      kinds:
      - "Event"
      clusters:
      - "*"

  # Repository credentials template
  repository.credentials: |
    - url: https://github.com/myorg
      usernameSecret:
        name: github-creds
        key: username
      passwordSecret:
        name: github-creds
        key: password
---
# Application Controller tuning
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  namespace: argo
data:
  # Controller settings
  controller.status.processors: "50"
  controller.operation.processors: "25"
  controller.self.heal.timeout.seconds: "5"
  controller.repo.server.timeout.seconds: "180"

  # Repo Server settings
  reposerver.parallelism.limit: "100"
  reposerver.enable.git.submodule: "false"

  # Server settings
  server.insecure: "false"
  server.enable.gzip: "true"
  server.x.frame.options: "sameorigin"

10.2.2 Argo Workflows Performance Tuning

# workflows-performance.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
  namespace: argo
data:
  config: |
    # Parallelism limit
    parallelism: 100

    # Namespace parallelism limit
    namespaceParallelism: 50

    # Resource rate limit
    resourceRateLimit:
      limit: 500
      burst: 1000

    # Pod GC
    podGCGracePeriodSeconds: 30
    podGCDeleteDelayDuration: 5s

    # Workflow archive
    persistence:
      archive: true
      archiveTTL: 7d
      postgresql:
        host: postgres
        port: 5432
        database: argo
        tableName: argo_workflows
        userNameSecret:
          name: argo-postgres-config
          key: username
        passwordSecret:
          name: argo-postgres-config
          key: password

    # Node status offloading
    nodeStatusOffLoad: true

    # Controller resources
    executorResources:
      requests:
        cpu: 100m
        memory: 64Mi
      limits:
        cpu: 500m
        memory: 512Mi

    # Workflow defaults
    workflowDefaults:
      spec:
        ttlStrategy:
          secondsAfterCompletion: 3600
          secondsAfterSuccess: 1800
          secondsAfterFailure: 7200
        podGC:
          strategy: OnPodSuccess
          deleteDelayDuration: 60s
        # Use emissary executor
        templateDefaults:
          container:
            imagePullPolicy: IfNotPresent

    # Metrics
    metricsConfig:
      enabled: true
      path: /metrics
      port: 9090
      metricsTTL: 10m

    # Link configuration
    links:
    - name: Workflow Logs
      scope: workflow
      url: https://grafana.example.com/d/workflow-logs?var-workflow=${metadata.name}

10.2.3 Resource Limit Best Practices

# resource-limits.yaml
# Argo CD resource configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
  namespace: argo
spec:
  template:
    spec:
      containers:
      - name: argocd-server
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
          limits:
            cpu: 2
            memory: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argo
spec:
  template:
    spec:
      containers:
      - name: argocd-repo-server
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 4
            memory: 4Gi
---
# Workflow Pod resource templates
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: resource-template
  namespace: argo
spec:
  templates:
  - name: small-task
    container:
      image: busybox
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 500m
          memory: 512Mi

  - name: medium-task
    container:
      image: busybox
      resources:
        requests:
          cpu: 500m
          memory: 512Mi
        limits:
          cpu: 2
          memory: 2Gi

  - name: large-task
    container:
      image: busybox
      resources:
        requests:
          cpu: 2
          memory: 2Gi
        limits:
          cpu: 4
          memory: 8Gi

  - name: gpu-task
    container:
      image: nvidia/cuda:latest
      resources:
        requests:
          cpu: 2
          memory: 4Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 4
          memory: 8Gi
          nvidia.com/gpu: 1

10.3 Monitoring and Alerting

10.3.1 Prometheus Monitoring Configuration

# argo-monitoring.yaml
# ServiceMonitor for Argo CD
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-metrics
  namespaceSelector:
    matchNames:
    - argo
  endpoints:
  - port: metrics
    interval: 30s
---
# ServiceMonitor for Argo Workflows
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argo-workflows-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: workflow-controller
  namespaceSelector:
    matchNames:
    - argo
  endpoints:
  - port: metrics
    interval: 30s
---
# ServiceMonitor for Argo Rollouts
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argo-rollouts-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: argo-rollouts-metrics
  namespaceSelector:
    matchNames:
    - argo-rollouts
  endpoints:
  - port: metrics
    interval: 30s

10.3.2 Grafana Dashboard

{
  "dashboard": {
    "title": "Argo Overview",
    "panels": [
      {
        "title": "Argo CD - Application Sync Status",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(argocd_app_info{sync_status=\"Synced\"})",
            "legendFormat": "Synced"
          },
          {
            "expr": "sum(argocd_app_info{sync_status=\"OutOfSync\"})",
            "legendFormat": "Out of Sync"
          }
        ]
      },
      {
        "title": "Argo Workflows - Active Workflows",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(argo_workflows_count{status=\"Running\"})",
            "legendFormat": "Running"
          },
          {
            "expr": "sum(argo_workflows_count{status=\"Pending\"})",
            "legendFormat": "Pending"
          }
        ]
      },
      {
        "title": "Argo Rollouts - Rollout Status",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rollout_info{phase=\"Healthy\"})",
            "legendFormat": "Healthy"
          },
          {
            "expr": "sum(rollout_info{phase=\"Progressing\"})",
            "legendFormat": "Progressing"
          },
          {
            "expr": "sum(rollout_info{phase=\"Degraded\"})",
            "legendFormat": "Degraded"
          }
        ]
      }
    ]
  }
}

10.3.3 Alert Rules

# argo-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argo-alerts
  namespace: monitoring
spec:
  groups:
  - name: argo-cd
    rules:
    - alert: ArgoCDAppOutOfSync
      expr: |
        argocd_app_info{sync_status="OutOfSync"} == 1
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "Application {{ $labels.name }} is out of sync"
        description: "Application {{ $labels.name }} in project {{ $labels.project }} has been out of sync for more than 30 minutes."

    - alert: ArgoCDAppHealthDegraded
      expr: |
        argocd_app_info{health_status="Degraded"} == 1
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "Application {{ $labels.name }} health is degraded"
        description: "Application {{ $labels.name }} health status is Degraded."

    - alert: ArgoCDSyncFailed
      expr: |
        argocd_app_sync_total{phase="Failed"} > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Application {{ $labels.name }} sync failed"
        description: "Application {{ $labels.name }} sync operation failed."

  - name: argo-workflows
    rules:
    - alert: WorkflowFailed
      expr: |
        increase(argo_workflows_count{status="Failed"}[5m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Workflow failed in namespace {{ $labels.namespace }}"
        description: "One or more workflows have failed in the last 5 minutes."

    - alert: WorkflowStuck
      expr: |
        argo_workflows_count{status="Running"} > 0
        and
        (time() - argo_workflow_start_time) > 7200
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Workflow running for too long"
        description: "Workflow has been running for more than 2 hours."

    - alert: WorkflowQueueBacklog
      expr: |
        argo_workflows_count{status="Pending"} > 50
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Workflow queue backlog is high"
        description: "More than 50 workflows are pending execution."

  - name: argo-rollouts
    rules:
    - alert: RolloutFailed
      expr: |
        rollout_info{phase="Failed"} == 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Rollout {{ $labels.name }} has failed"
        description: "Rollout {{ $labels.name }} in namespace {{ $labels.namespace }} has failed."

    - alert: RolloutPaused
      expr: |
        rollout_info{phase="Paused"} == 1
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Rollout {{ $labels.name }} is paused"
        description: "Rollout {{ $labels.name }} has been paused for more than 1 hour."

    - alert: AnalysisRunFailed
      expr: |
        analysis_run_info{phase="Failed"} == 1
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Analysis run failed for rollout {{ $labels.rollout }}"
        description: "Analysis run has failed, rollback may be triggered."

10.4 Troubleshooting

10.4.1 Common Issue Diagnostics

#!/bin/bash
# argo-diagnose.sh - Argo diagnostic script

echo "=== Argo CD Diagnostics ==="

# Check Argo CD component status
echo "Checking Argo CD components..."
kubectl get pods -n argo -l app.kubernetes.io/part-of=argocd

# Check application sync status
echo "Checking application sync status..."
kubectl get applications -n argo -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status

# Check applications with sync errors
echo "Applications with sync errors..."
kubectl get applications -n argo -o json | jq -r '.items[] | select(.status.sync.status != "Synced") | "\(.metadata.name): \(.status.conditions[0].message)"'

echo "=== Argo Workflows Diagnostics ==="

# Check Workflow Controller
echo "Checking Workflow Controller..."
kubectl get pods -n argo -l app=workflow-controller
kubectl logs -n argo -l app=workflow-controller --tail=50

# Check failed workflows
echo "Failed workflows..."
kubectl get workflows -n argo --field-selector status.phase=Failed

# Check pending workflows
echo "Pending workflows..."
kubectl get workflows -n argo --field-selector status.phase=Pending

echo "=== Argo Rollouts Diagnostics ==="

# Check Rollout status
echo "Checking Rollouts..."
kubectl get rollouts -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase

# Check failed AnalysisRuns
echo "Failed AnalysisRuns..."
kubectl get analysisruns -A --field-selector status.phase=Failed

echo "=== Argo Events Diagnostics ==="

# Check EventBus
echo "Checking EventBus..."
kubectl get eventbus -n argo-events

# Check EventSource
echo "Checking EventSources..."
kubectl get eventsources -n argo-events

# Check Sensor
echo "Checking Sensors..."
kubectl get sensors -n argo-events

10.4.2 Log Analysis

# Log collection configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Tag               argo.*
        Path              /var/log/containers/argo*.log
        Parser            docker
        DB                /var/log/flb_argo.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10

    [FILTER]
        Name                kubernetes
        Match               argo.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On

    [OUTPUT]
        Name            es
        Match           argo.*
        Host            elasticsearch
        Port            9200
        Index           argo-logs
        Type            _doc

10.4.3 Common Problem Solutions

# Problem 1: Application stuck in "Progressing"
# Solution: Check resource health status

# Force refresh application
# argocd app get <app-name> --refresh

# Problem 2: Workflow Pod OOMKilled
# Solution: Increase resource limits
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
  templates:
  - name: memory-intensive
    container:
      resources:
        requests:
          memory: 2Gi
        limits:
          memory: 4Gi

# Problem 3: Rollout stuck at Paused
# Solution: Manually promote or cancel
# kubectl argo rollouts promote <rollout-name>
# kubectl argo rollouts abort <rollout-name>

# Problem 4: EventSource not receiving events
# Solution: Check network and port configuration
apiVersion: v1
kind: Service
metadata:
  name: webhook-eventsource-svc
  namespace: argo-events
spec:
  type: LoadBalancer  # or use Ingress
  ports:
  - port: 12000
    targetPort: 12000
  selector:
    eventsource-name: webhook

10.5 Backup and Recovery

10.5.1 Argo CD Backup

#!/bin/bash
# argocd-backup.sh

BACKUP_DIR="/backup/argocd/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

# Backup all Applications
echo "Backing up Applications..."
kubectl get applications -n argo -o yaml > $BACKUP_DIR/applications.yaml

# Backup all AppProjects
echo "Backing up AppProjects..."
kubectl get appprojects -n argo -o yaml > $BACKUP_DIR/appprojects.yaml

# Backup ConfigMaps
echo "Backing up ConfigMaps..."
kubectl get configmaps -n argo -o yaml > $BACKUP_DIR/configmaps.yaml

# Backup Secrets (note security)
echo "Backing up Secrets..."
kubectl get secrets -n argo -o yaml > $BACKUP_DIR/secrets.yaml

# Backup RBAC configuration
echo "Backing up RBAC..."
kubectl get configmap argocd-rbac-cm -n argo -o yaml > $BACKUP_DIR/rbac.yaml

# Backup repository credentials
echo "Backing up repository credentials..."
kubectl get secrets -n argo -l argocd.argoproj.io/secret-type=repository -o yaml > $BACKUP_DIR/repo-creds.yaml

# Compress backup
tar -czf $BACKUP_DIR.tar.gz -C /backup/argocd $(date +%Y%m%d)

echo "Backup completed: $BACKUP_DIR.tar.gz"

10.5.2 Argo Workflows Backup

#!/bin/bash
# workflows-backup.sh

BACKUP_DIR="/backup/workflows/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

# Backup WorkflowTemplates
echo "Backing up WorkflowTemplates..."
kubectl get workflowtemplates -A -o yaml > $BACKUP_DIR/workflowtemplates.yaml

# Backup ClusterWorkflowTemplates
echo "Backing up ClusterWorkflowTemplates..."
kubectl get clusterworkflowtemplates -o yaml > $BACKUP_DIR/clusterworkflowtemplates.yaml

# Backup CronWorkflows
echo "Backing up CronWorkflows..."
kubectl get cronworkflows -A -o yaml > $BACKUP_DIR/cronworkflows.yaml

# Backup ConfigMap
echo "Backing up ConfigMap..."
kubectl get configmap workflow-controller-configmap -n argo -o yaml > $BACKUP_DIR/controller-config.yaml

echo "Backup completed: $BACKUP_DIR"

10.5.3 Disaster Recovery Process

# disaster-recovery-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: disaster-recovery
  namespace: argo
spec:
  entrypoint: recovery-pipeline
  arguments:
    parameters:
    - name: backup-location
      value: "s3://backups/argo/latest"
  templates:
  - name: recovery-pipeline
    dag:
      tasks:
      - name: verify-backup
        template: verify-backup
      - name: restore-argocd
        dependencies: [verify-backup]
        template: restore-argocd
      - name: restore-workflows
        dependencies: [verify-backup]
        template: restore-workflows
      - name: restore-events
        dependencies: [verify-backup]
        template: restore-events
      - name: verify-recovery
        dependencies: [restore-argocd, restore-workflows, restore-events]
        template: verify-recovery

  - name: verify-backup
    container:
      image: amazon/aws-cli:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        aws s3 ls {{workflow.parameters.backup-location}} || exit 1
        echo "Backup verified"

  - name: restore-argocd
    container:
      image: bitnami/kubectl:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        # Download backup
        aws s3 cp {{workflow.parameters.backup-location}}/argocd/ /tmp/argocd/ --recursive

        # Restore AppProjects
        kubectl apply -f /tmp/argocd/appprojects.yaml

        # Restore Applications
        kubectl apply -f /tmp/argocd/applications.yaml

        # Restore repository credentials
        kubectl apply -f /tmp/argocd/repo-creds.yaml

        echo "Argo CD restored"

  - name: restore-workflows
    container:
      image: bitnami/kubectl:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        aws s3 cp {{workflow.parameters.backup-location}}/workflows/ /tmp/workflows/ --recursive

        kubectl apply -f /tmp/workflows/workflowtemplates.yaml
        kubectl apply -f /tmp/workflows/clusterworkflowtemplates.yaml
        kubectl apply -f /tmp/workflows/cronworkflows.yaml

        echo "Argo Workflows restored"

  - name: restore-events
    container:
      image: bitnami/kubectl:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        aws s3 cp {{workflow.parameters.backup-location}}/events/ /tmp/events/ --recursive

        kubectl apply -f /tmp/events/

        echo "Argo Events restored"

  - name: verify-recovery
    container:
      image: bitnami/kubectl:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        # Verify Argo CD
        kubectl get applications -n argo
        argocd app list

        # Verify Workflows
        kubectl get workflowtemplates -A
        kubectl get cronworkflows -A

        # Verify Events
        kubectl get eventsources -n argo-events
        kubectl get sensors -n argo-events

        echo "Recovery verified"

10.6 Operations Checklist

10.6.1 Daily Maintenance Tasks

# daily-maintenance-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: daily-maintenance
  namespace: argo
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  workflowSpec:
    entrypoint: maintenance
    templates:
    - name: maintenance
      dag:
        tasks:
        - name: cleanup-workflows
          template: cleanup-workflows
        - name: check-health
          template: health-check
        - name: backup
          template: backup
          dependencies: [check-health]
        - name: report
          template: generate-report
          dependencies: [cleanup-workflows, backup]

    - name: cleanup-workflows
      container:
        image: bitnami/kubectl:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          # Clean workflows completed more than 7 days ago
          kubectl delete workflows -n argo \
            --field-selector status.phase=Succeeded \
            --selector "!workflows.argoproj.io/completed"

          # Clean failed workflows older than 30 days
          kubectl get workflows -n argo -o json | \
            jq -r '.items[] | select(.status.phase=="Failed") | select(.status.finishedAt | fromdateiso8601 < (now - 2592000)) | .metadata.name' | \
            xargs -r kubectl delete workflow -n argo

    - name: health-check
      container:
        image: curlimages/curl:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          # Check Argo CD
          curl -f http://argocd-server.argo:80/healthz || exit 1

          # Check Argo Server
          curl -f http://argo-server.argo:2746/api/v1/info || exit 1

          echo "All services healthy"

    - name: backup
      container:
        image: amazon/aws-cli:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          DATE=$(date +%Y%m%d)
          kubectl get applications,appprojects -n argo -o yaml | \
            aws s3 cp - s3://backups/argo/$DATE/argocd.yaml

          kubectl get workflowtemplates,cronworkflows -A -o yaml | \
            aws s3 cp - s3://backups/argo/$DATE/workflows.yaml

    - name: generate-report
      container:
        image: curlimages/curl:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          # Send daily report to Slack
          curl -X POST $SLACK_WEBHOOK \
            -H "Content-Type: application/json" \
            -d '{
              "text": "Daily Maintenance Report",
              "attachments": [{
                "color": "good",
                "text": "All maintenance tasks completed successfully"
              }]
            }'

10.6.2 Upgrade Checklist

# Argo Upgrade Checklist

## Pre-Upgrade Preparation
- [ ] Backup all configurations and data
- [ ] Check version compatibility and changelog
- [ ] Validate upgrade in test environment
- [ ] Notify relevant teams

## Upgrade Steps
- [ ] Pause auto-sync
- [ ] Upgrade CRDs
- [ ] Upgrade controller
- [ ] Upgrade UI/Server
- [ ] Verify functionality

## Post-Upgrade Verification
- [ ] Check all component status
- [ ] Verify Application sync
- [ ] Verify Workflow execution
- [ ] Verify Rollout functionality
- [ ] Check monitoring and alerting
- [ ] Resume auto-sync

10.7 Chapter Summary

This chapter summarized the best practices for the Argo ecosystem in production environments:

🔄 正在渲染 Mermaid 图表...

Key Points:

  1. High Availability: Use multi-replicas, anti-affinity, and leader election to ensure availability
  2. Performance Optimization: Properly configure parallelism limits, resource limits, and caching
  3. Monitoring & Alerting: Establish comprehensive monitoring and alerting system
  4. Troubleshooting: Prepare diagnostic tools and common problem solutions
  5. Backup & Recovery: Regular backups and test recovery processes
  6. Daily Operations: Automate daily maintenance tasks

Course Summary

Through this course, you have mastered the core components and best practices of the Argo ecosystem:

  1. Argo CD: GitOps continuous deployment
  2. Argo Workflows: Cloud-native workflow engine
  3. Argo Rollouts: Progressive delivery
  4. Argo Events: Event-driven automation

By combining these components, you can build a complete cloud-native CI/CD platform, enabling fully automated processes from code commit to production deployment.

Best wishes on your cloud-native journey!