Chapter 10: Production Environment Best Practices

Summary of Argo ecosystem best practices in production environments, including high availability deployment, performance optimization, troubleshooting, and operations guide

作者

January 11, 2026

33min

#Argo #Kubernetes #Production #Best Practices #Operations

Production Environment Best Practices

Chapter 10: Argo Production Environment Operations Guide

This chapter will summarize the best practices for the Argo ecosystem in production environments, helping you build a stable, efficient, and maintainable cloud-native CI/CD platform.

10.1 High Availability Deployment

10.1.1 Argo CD High Availability Architecture

🔄 正在渲染 Mermaid 图表...

graph TB
    subgraph "High Availability Argo CD"
        LB[Load Balancer]

        subgraph "API Server Layer"
            API1[argocd-server-1]
            API2[argocd-server-2]
            API3[argocd-server-3]
        end

        subgraph "Controller Layer"
            CTRL1[application-controller-1]
            CTRL2[application-controller-2]
        end

        subgraph "Repo Server Layer"
            REPO1[repo-server-1]
            REPO2[repo-server-2]
            REPO3[repo-server-3]
        end

        subgraph "Redis Cluster"
            R1[Redis Master]
            R2[Redis Replica]
            R3[Redis Replica]
        end

        LB --> API1
        LB --> API2
        LB --> API3

        API1 --> REPO1
        API2 --> REPO2
        API3 --> REPO3

        CTRL1 --> R1
        CTRL2 --> R1
        R1 --> R2
        R1 --> R3
    end

10.1.2 High Availability Deployment Configuration

# argocd-ha-install.yaml
# High Availability Argo CD Deployment
# kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

# Custom high availability configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
  namespace: argo
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-server
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/name: argocd-server
            topologyKey: kubernetes.io/hostname
      containers:
      - name: argocd-server
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 2
            memory: 2Gi
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 30
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 30
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argo
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-repo-server
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/name: argocd-repo-server
            topologyKey: kubernetes.io/hostname
      containers:
      - name: argocd-repo-server
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 2
            memory: 2Gi
---
# Application Controller uses StatefulSet for sharding
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
  namespace: argo
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-application-controller
  template:
    spec:
      containers:
      - name: argocd-application-controller
        env:
        # Enable controller sharding
        - name: ARGOCD_CONTROLLER_REPLICAS
          value: "2"
        resources:
          requests:
            cpu: 1
            memory: 1Gi
          limits:
            cpu: 4
            memory: 4Gi
---
# Redis high availability
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-redis-ha
  namespace: argo
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-redis-ha
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/name: argocd-redis-ha
            topologyKey: kubernetes.io/hostname
      containers:
      - name: redis
        image: redis:7-alpine
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi

10.1.3 Argo Workflows High Availability

# workflows-ha.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: workflow-controller
  namespace: argo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: workflow-controller
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: workflow-controller
              topologyKey: kubernetes.io/hostname
      containers:
      - name: workflow-controller
        args:
        - --configmap
        - workflow-controller-configmap
        - --executor-image
        - quay.io/argoproj/argoexec:latest
        # Leader election
        - --leader-elect
        env:
        - name: LEADER_ELECTION_IDENTITY
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 2
            memory: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argo-server
  namespace: argo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: argo-server
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: argo-server
              topologyKey: kubernetes.io/hostname
      containers:
      - name: argo-server
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 1
            memory: 1Gi

10.1.4 EventBus High Availability

# eventbus-ha.yaml
apiVersion: argoproj.io/v1alpha1
kind: EventBus
metadata:
  name: default
  namespace: argo-events
spec:
  jetstream:
    version: "2.9.15"
    replicas: 3
    persistence:
      storageClassName: fast-ssd
      accessMode: ReadWriteOnce
      volumeSize: 20Gi
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              controller: eventbus-controller
          topologyKey: kubernetes.io/hostname
    containerTemplate:
      resources:
        requests:
          cpu: 500m
          memory: 512Mi
        limits:
          cpu: 2
          memory: 2Gi
    streamConfig: |
      max_msgs: 1000000
      max_bytes: 10737418240
      max_age: 168h
      replicas: 3
      retention: limits
      storage: file

10.2 Performance Optimization

10.2.1 Argo CD Performance Tuning

# argocd-performance-tuning.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argo
data:
  # Parallel sync limit
  controller.sync.concurrency.limit: "50"

  # Status cache
  controller.status.processors: "20"
  controller.operation.processors: "10"

  # Repo server parallelism limit
  reposerver.parallelism.limit: "200"

  # Timeout settings
  timeout.reconciliation: "180s"
  timeout.hard.reconciliation: "0"

  # Resource health check interval
  resource.healthcheck.interval: "10s"

  # Resource comparison options
  resource.compareoptions: |
    ignoreAggregatedRoles: true
    ignoreResourceStatusField: all

  # Exclude resources that don't need tracking
  resource.exclusions: |
    - apiGroups:
      - "cilium.io"
      kinds:
      - CiliumIdentity
      clusters:
      - "*"
    - apiGroups:
      - ""
      kinds:
      - "Event"
      clusters:
      - "*"

  # Repository credentials template
  repository.credentials: |
    - url: https://github.com/myorg
      usernameSecret:
        name: github-creds
        key: username
      passwordSecret:
        name: github-creds
        key: password
---
# Application Controller tuning
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  namespace: argo
data:
  # Controller settings
  controller.status.processors: "50"
  controller.operation.processors: "25"
  controller.self.heal.timeout.seconds: "5"
  controller.repo.server.timeout.seconds: "180"

  # Repo Server settings
  reposerver.parallelism.limit: "100"
  reposerver.enable.git.submodule: "false"

  # Server settings
  server.insecure: "false"
  server.enable.gzip: "true"
  server.x.frame.options: "sameorigin"

10.2.2 Argo Workflows Performance Tuning

# workflows-performance.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
  namespace: argo
data:
  config: |
    # Parallelism limit
    parallelism: 100

    # Namespace parallelism limit
    namespaceParallelism: 50

    # Resource rate limit
    resourceRateLimit:
      limit: 500
      burst: 1000

    # Pod GC
    podGCGracePeriodSeconds: 30
    podGCDeleteDelayDuration: 5s

    # Workflow archive
    persistence:
      archive: true
      archiveTTL: 7d
      postgresql:
        host: postgres
        port: 5432
        database: argo
        tableName: argo_workflows
        userNameSecret:
          name: argo-postgres-config
          key: username
        passwordSecret:
          name: argo-postgres-config
          key: password

    # Node status offloading
    nodeStatusOffLoad: true

    # Controller resources
    executorResources:
      requests:
        cpu: 100m
        memory: 64Mi
      limits:
        cpu: 500m
        memory: 512Mi

    # Workflow defaults
    workflowDefaults:
      spec:
        ttlStrategy:
          secondsAfterCompletion: 3600
          secondsAfterSuccess: 1800
          secondsAfterFailure: 7200
        podGC:
          strategy: OnPodSuccess
          deleteDelayDuration: 60s
        # Use emissary executor
        templateDefaults:
          container:
            imagePullPolicy: IfNotPresent

    # Metrics
    metricsConfig:
      enabled: true
      path: /metrics
      port: 9090
      metricsTTL: 10m

    # Link configuration
    links:
    - name: Workflow Logs
      scope: workflow
      url: https://grafana.example.com/d/workflow-logs?var-workflow=${metadata.name}

10.2.3 Resource Limit Best Practices

# resource-limits.yaml
# Argo CD resource configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
  namespace: argo
spec:
  template:
    spec:
      containers:
      - name: argocd-server
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
          limits:
            cpu: 2
            memory: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argo
spec:
  template:
    spec:
      containers:
      - name: argocd-repo-server
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 4
            memory: 4Gi
---
# Workflow Pod resource templates
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: resource-template
  namespace: argo
spec:
  templates:
  - name: small-task
    container:
      image: busybox
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 500m
          memory: 512Mi

  - name: medium-task
    container:
      image: busybox
      resources:
        requests:
          cpu: 500m
          memory: 512Mi
        limits:
          cpu: 2
          memory: 2Gi

  - name: large-task
    container:
      image: busybox
      resources:
        requests:
          cpu: 2
          memory: 2Gi
        limits:
          cpu: 4
          memory: 8Gi

  - name: gpu-task
    container:
      image: nvidia/cuda:latest
      resources:
        requests:
          cpu: 2
          memory: 4Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 4
          memory: 8Gi
          nvidia.com/gpu: 1

10.3 Monitoring and Alerting

10.3.1 Prometheus Monitoring Configuration

# argo-monitoring.yaml
# ServiceMonitor for Argo CD
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-metrics
  namespaceSelector:
    matchNames:
    - argo
  endpoints:
  - port: metrics
    interval: 30s
---
# ServiceMonitor for Argo Workflows
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argo-workflows-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: workflow-controller
  namespaceSelector:
    matchNames:
    - argo
  endpoints:
  - port: metrics
    interval: 30s
---
# ServiceMonitor for Argo Rollouts
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argo-rollouts-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: argo-rollouts-metrics
  namespaceSelector:
    matchNames:
    - argo-rollouts
  endpoints:
  - port: metrics
    interval: 30s

10.3.2 Grafana Dashboard

{
  "dashboard": {
    "title": "Argo Overview",
    "panels": [
      {
        "title": "Argo CD - Application Sync Status",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(argocd_app_info{sync_status=\"Synced\"})",
            "legendFormat": "Synced"
          },
          {
            "expr": "sum(argocd_app_info{sync_status=\"OutOfSync\"})",
            "legendFormat": "Out of Sync"
          }
        ]
      },
      {
        "title": "Argo Workflows - Active Workflows",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(argo_workflows_count{status=\"Running\"})",
            "legendFormat": "Running"
          },
          {
            "expr": "sum(argo_workflows_count{status=\"Pending\"})",
            "legendFormat": "Pending"
          }
        ]
      },
      {
        "title": "Argo Rollouts - Rollout Status",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rollout_info{phase=\"Healthy\"})",
            "legendFormat": "Healthy"
          },
          {
            "expr": "sum(rollout_info{phase=\"Progressing\"})",
            "legendFormat": "Progressing"
          },
          {
            "expr": "sum(rollout_info{phase=\"Degraded\"})",
            "legendFormat": "Degraded"
          }
        ]
      }
    ]
  }
}

10.3.3 Alert Rules

# argo-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argo-alerts
  namespace: monitoring
spec:
  groups:
  - name: argo-cd
    rules:
    - alert: ArgoCDAppOutOfSync
      expr: |
        argocd_app_info{sync_status="OutOfSync"} == 1
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "Application {{ $labels.name }} is out of sync"
        description: "Application {{ $labels.name }} in project {{ $labels.project }} has been out of sync for more than 30 minutes."

    - alert: ArgoCDAppHealthDegraded
      expr: |
        argocd_app_info{health_status="Degraded"} == 1
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "Application {{ $labels.name }} health is degraded"
        description: "Application {{ $labels.name }} health status is Degraded."

    - alert: ArgoCDSyncFailed
      expr: |
        argocd_app_sync_total{phase="Failed"} > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Application {{ $labels.name }} sync failed"
        description: "Application {{ $labels.name }} sync operation failed."

  - name: argo-workflows
    rules:
    - alert: WorkflowFailed
      expr: |
        increase(argo_workflows_count{status="Failed"}[5m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Workflow failed in namespace {{ $labels.namespace }}"
        description: "One or more workflows have failed in the last 5 minutes."

    - alert: WorkflowStuck
      expr: |
        argo_workflows_count{status="Running"} > 0
        and
        (time() - argo_workflow_start_time) > 7200
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Workflow running for too long"
        description: "Workflow has been running for more than 2 hours."

    - alert: WorkflowQueueBacklog
      expr: |
        argo_workflows_count{status="Pending"} > 50
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Workflow queue backlog is high"
        description: "More than 50 workflows are pending execution."

  - name: argo-rollouts
    rules:
    - alert: RolloutFailed
      expr: |
        rollout_info{phase="Failed"} == 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Rollout {{ $labels.name }} has failed"
        description: "Rollout {{ $labels.name }} in namespace {{ $labels.namespace }} has failed."

    - alert: RolloutPaused
      expr: |
        rollout_info{phase="Paused"} == 1
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Rollout {{ $labels.name }} is paused"
        description: "Rollout {{ $labels.name }} has been paused for more than 1 hour."

    - alert: AnalysisRunFailed
      expr: |
        analysis_run_info{phase="Failed"} == 1
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Analysis run failed for rollout {{ $labels.rollout }}"
        description: "Analysis run has failed, rollback may be triggered."

10.4 Troubleshooting

10.4.1 Common Issue Diagnostics

#!/bin/bash
# argo-diagnose.sh - Argo diagnostic script

echo "=== Argo CD Diagnostics ==="

# Check Argo CD component status
echo "Checking Argo CD components..."
kubectl get pods -n argo -l app.kubernetes.io/part-of=argocd

# Check application sync status
echo "Checking application sync status..."
kubectl get applications -n argo -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status

# Check applications with sync errors
echo "Applications with sync errors..."
kubectl get applications -n argo -o json | jq -r '.items[] | select(.status.sync.status != "Synced") | "\(.metadata.name): \(.status.conditions[0].message)"'

echo "=== Argo Workflows Diagnostics ==="

# Check Workflow Controller
echo "Checking Workflow Controller..."
kubectl get pods -n argo -l app=workflow-controller
kubectl logs -n argo -l app=workflow-controller --tail=50

# Check failed workflows
echo "Failed workflows..."
kubectl get workflows -n argo --field-selector status.phase=Failed

# Check pending workflows
echo "Pending workflows..."
kubectl get workflows -n argo --field-selector status.phase=Pending

echo "=== Argo Rollouts Diagnostics ==="

# Check Rollout status
echo "Checking Rollouts..."
kubectl get rollouts -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase

# Check failed AnalysisRuns
echo "Failed AnalysisRuns..."
kubectl get analysisruns -A --field-selector status.phase=Failed

echo "=== Argo Events Diagnostics ==="

# Check EventBus
echo "Checking EventBus..."
kubectl get eventbus -n argo-events

# Check EventSource
echo "Checking EventSources..."
kubectl get eventsources -n argo-events

# Check Sensor
echo "Checking Sensors..."
kubectl get sensors -n argo-events

10.4.2 Log Analysis

# Log collection configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Tag               argo.*
        Path              /var/log/containers/argo*.log
        Parser            docker
        DB                /var/log/flb_argo.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10

    [FILTER]
        Name                kubernetes
        Match               argo.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On

    [OUTPUT]
        Name            es
        Match           argo.*
        Host            elasticsearch
        Port            9200
        Index           argo-logs
        Type            _doc

10.4.3 Common Problem Solutions

# Problem 1: Application stuck in "Progressing"
# Solution: Check resource health status

# Force refresh application
# argocd app get <app-name> --refresh

# Problem 2: Workflow Pod OOMKilled
# Solution: Increase resource limits
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
  templates:
  - name: memory-intensive
    container:
      resources:
        requests:
          memory: 2Gi
        limits:
          memory: 4Gi

# Problem 3: Rollout stuck at Paused
# Solution: Manually promote or cancel
# kubectl argo rollouts promote <rollout-name>
# kubectl argo rollouts abort <rollout-name>

# Problem 4: EventSource not receiving events
# Solution: Check network and port configuration
apiVersion: v1
kind: Service
metadata:
  name: webhook-eventsource-svc
  namespace: argo-events
spec:
  type: LoadBalancer  # or use Ingress
  ports:
  - port: 12000
    targetPort: 12000
  selector:
    eventsource-name: webhook

10.5 Backup and Recovery

10.5.1 Argo CD Backup

#!/bin/bash
# argocd-backup.sh

BACKUP_DIR="/backup/argocd/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

# Backup all Applications
echo "Backing up Applications..."
kubectl get applications -n argo -o yaml > $BACKUP_DIR/applications.yaml

# Backup all AppProjects
echo "Backing up AppProjects..."
kubectl get appprojects -n argo -o yaml > $BACKUP_DIR/appprojects.yaml

# Backup ConfigMaps
echo "Backing up ConfigMaps..."
kubectl get configmaps -n argo -o yaml > $BACKUP_DIR/configmaps.yaml

# Backup Secrets (note security)
echo "Backing up Secrets..."
kubectl get secrets -n argo -o yaml > $BACKUP_DIR/secrets.yaml

# Backup RBAC configuration
echo "Backing up RBAC..."
kubectl get configmap argocd-rbac-cm -n argo -o yaml > $BACKUP_DIR/rbac.yaml

# Backup repository credentials
echo "Backing up repository credentials..."
kubectl get secrets -n argo -l argocd.argoproj.io/secret-type=repository -o yaml > $BACKUP_DIR/repo-creds.yaml

# Compress backup
tar -czf $BACKUP_DIR.tar.gz -C /backup/argocd $(date +%Y%m%d)

echo "Backup completed: $BACKUP_DIR.tar.gz"

10.5.2 Argo Workflows Backup

#!/bin/bash
# workflows-backup.sh

BACKUP_DIR="/backup/workflows/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

# Backup WorkflowTemplates
echo "Backing up WorkflowTemplates..."
kubectl get workflowtemplates -A -o yaml > $BACKUP_DIR/workflowtemplates.yaml

# Backup ClusterWorkflowTemplates
echo "Backing up ClusterWorkflowTemplates..."
kubectl get clusterworkflowtemplates -o yaml > $BACKUP_DIR/clusterworkflowtemplates.yaml

# Backup CronWorkflows
echo "Backing up CronWorkflows..."
kubectl get cronworkflows -A -o yaml > $BACKUP_DIR/cronworkflows.yaml

# Backup ConfigMap
echo "Backing up ConfigMap..."
kubectl get configmap workflow-controller-configmap -n argo -o yaml > $BACKUP_DIR/controller-config.yaml

echo "Backup completed: $BACKUP_DIR"

10.5.3 Disaster Recovery Process

# disaster-recovery-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: disaster-recovery
  namespace: argo
spec:
  entrypoint: recovery-pipeline
  arguments:
    parameters:
    - name: backup-location
      value: "s3://backups/argo/latest"
  templates:
  - name: recovery-pipeline
    dag:
      tasks:
      - name: verify-backup
        template: verify-backup
      - name: restore-argocd
        dependencies: [verify-backup]
        template: restore-argocd
      - name: restore-workflows
        dependencies: [verify-backup]
        template: restore-workflows
      - name: restore-events
        dependencies: [verify-backup]
        template: restore-events
      - name: verify-recovery
        dependencies: [restore-argocd, restore-workflows, restore-events]
        template: verify-recovery

  - name: verify-backup
    container:
      image: amazon/aws-cli:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        aws s3 ls {{workflow.parameters.backup-location}} || exit 1
        echo "Backup verified"

  - name: restore-argocd
    container:
      image: bitnami/kubectl:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        # Download backup
        aws s3 cp {{workflow.parameters.backup-location}}/argocd/ /tmp/argocd/ --recursive

        # Restore AppProjects
        kubectl apply -f /tmp/argocd/appprojects.yaml

        # Restore Applications
        kubectl apply -f /tmp/argocd/applications.yaml

        # Restore repository credentials
        kubectl apply -f /tmp/argocd/repo-creds.yaml

        echo "Argo CD restored"

  - name: restore-workflows
    container:
      image: bitnami/kubectl:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        aws s3 cp {{workflow.parameters.backup-location}}/workflows/ /tmp/workflows/ --recursive

        kubectl apply -f /tmp/workflows/workflowtemplates.yaml
        kubectl apply -f /tmp/workflows/clusterworkflowtemplates.yaml
        kubectl apply -f /tmp/workflows/cronworkflows.yaml

        echo "Argo Workflows restored"

  - name: restore-events
    container:
      image: bitnami/kubectl:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        aws s3 cp {{workflow.parameters.backup-location}}/events/ /tmp/events/ --recursive

        kubectl apply -f /tmp/events/

        echo "Argo Events restored"

  - name: verify-recovery
    container:
      image: bitnami/kubectl:latest
      command: ["/bin/sh", "-c"]
      args:
      - |
        # Verify Argo CD
        kubectl get applications -n argo
        argocd app list

        # Verify Workflows
        kubectl get workflowtemplates -A
        kubectl get cronworkflows -A

        # Verify Events
        kubectl get eventsources -n argo-events
        kubectl get sensors -n argo-events

        echo "Recovery verified"

10.6 Operations Checklist

10.6.1 Daily Maintenance Tasks

# daily-maintenance-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: daily-maintenance
  namespace: argo
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  workflowSpec:
    entrypoint: maintenance
    templates:
    - name: maintenance
      dag:
        tasks:
        - name: cleanup-workflows
          template: cleanup-workflows
        - name: check-health
          template: health-check
        - name: backup
          template: backup
          dependencies: [check-health]
        - name: report
          template: generate-report
          dependencies: [cleanup-workflows, backup]

    - name: cleanup-workflows
      container:
        image: bitnami/kubectl:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          # Clean workflows completed more than 7 days ago
          kubectl delete workflows -n argo \
            --field-selector status.phase=Succeeded \
            --selector "!workflows.argoproj.io/completed"

          # Clean failed workflows older than 30 days
          kubectl get workflows -n argo -o json | \
            jq -r '.items[] | select(.status.phase=="Failed") | select(.status.finishedAt | fromdateiso8601 < (now - 2592000)) | .metadata.name' | \
            xargs -r kubectl delete workflow -n argo

    - name: health-check
      container:
        image: curlimages/curl:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          # Check Argo CD
          curl -f http://argocd-server.argo:80/healthz || exit 1

          # Check Argo Server
          curl -f http://argo-server.argo:2746/api/v1/info || exit 1

          echo "All services healthy"

    - name: backup
      container:
        image: amazon/aws-cli:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          DATE=$(date +%Y%m%d)
          kubectl get applications,appprojects -n argo -o yaml | \
            aws s3 cp - s3://backups/argo/$DATE/argocd.yaml

          kubectl get workflowtemplates,cronworkflows -A -o yaml | \
            aws s3 cp - s3://backups/argo/$DATE/workflows.yaml

    - name: generate-report
      container:
        image: curlimages/curl:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          # Send daily report to Slack
          curl -X POST $SLACK_WEBHOOK \
            -H "Content-Type: application/json" \
            -d '{
              "text": "Daily Maintenance Report",
              "attachments": [{
                "color": "good",
                "text": "All maintenance tasks completed successfully"
              }]
            }'

10.6.2 Upgrade Checklist

# Argo Upgrade Checklist

## Pre-Upgrade Preparation
- [ ] Backup all configurations and data
- [ ] Check version compatibility and changelog
- [ ] Validate upgrade in test environment
- [ ] Notify relevant teams

## Upgrade Steps
- [ ] Pause auto-sync
- [ ] Upgrade CRDs
- [ ] Upgrade controller
- [ ] Upgrade UI/Server
- [ ] Verify functionality

## Post-Upgrade Verification
- [ ] Check all component status
- [ ] Verify Application sync
- [ ] Verify Workflow execution
- [ ] Verify Rollout functionality
- [ ] Check monitoring and alerting
- [ ] Resume auto-sync

10.7 Chapter Summary

This chapter summarized the best practices for the Argo ecosystem in production environments: