Chapter 10: Production Environment Best Practices
Summary of Argo ecosystem best practices in production environments, including high availability deployment, performance optimization, troubleshooting, and operations guide
Production Environment Best Practices
Chapter 10: Argo Production Environment Operations Guide
This chapter will summarize the best practices for the Argo ecosystem in production environments, helping you build a stable, efficient, and maintainable cloud-native CI/CD platform.
10.1 High Availability Deployment
10.1.1 Argo CD High Availability Architecture
10.1.2 High Availability Deployment Configuration
# argocd-ha-install.yaml
# High Availability Argo CD Deployment
# kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml
# Custom high availability configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: argocd-server
namespace: argo
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/name: argocd-server
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: argocd-server
topologyKey: kubernetes.io/hostname
containers:
- name: argocd-server
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2
memory: 2Gi
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 30
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 30
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: argocd-repo-server
namespace: argo
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/name: argocd-repo-server
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: argocd-repo-server
topologyKey: kubernetes.io/hostname
containers:
- name: argocd-repo-server
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2
memory: 2Gi
---
# Application Controller uses StatefulSet for sharding
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: argocd-application-controller
namespace: argo
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: argocd-application-controller
template:
spec:
containers:
- name: argocd-application-controller
env:
# Enable controller sharding
- name: ARGOCD_CONTROLLER_REPLICAS
value: "2"
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 4
memory: 4Gi
---
# Redis high availability
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: argocd-redis-ha
namespace: argo
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/name: argocd-redis-ha
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: argocd-redis-ha
topologyKey: kubernetes.io/hostname
containers:
- name: redis
image: redis:7-alpine
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
10.1.3 Argo Workflows High Availability
# workflows-ha.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: workflow-controller
namespace: argo
spec:
replicas: 2
selector:
matchLabels:
app: workflow-controller
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: workflow-controller
topologyKey: kubernetes.io/hostname
containers:
- name: workflow-controller
args:
- --configmap
- workflow-controller-configmap
- --executor-image
- quay.io/argoproj/argoexec:latest
# Leader election
- --leader-elect
env:
- name: LEADER_ELECTION_IDENTITY
valueFrom:
fieldRef:
fieldPath: metadata.name
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2
memory: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: argo-server
namespace: argo
spec:
replicas: 3
selector:
matchLabels:
app: argo-server
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: argo-server
topologyKey: kubernetes.io/hostname
containers:
- name: argo-server
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1
memory: 1Gi
10.1.4 EventBus High Availability
# eventbus-ha.yaml
apiVersion: argoproj.io/v1alpha1
kind: EventBus
metadata:
name: default
namespace: argo-events
spec:
jetstream:
version: "2.9.15"
replicas: 3
persistence:
storageClassName: fast-ssd
accessMode: ReadWriteOnce
volumeSize: 20Gi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
controller: eventbus-controller
topologyKey: kubernetes.io/hostname
containerTemplate:
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2
memory: 2Gi
streamConfig: |
max_msgs: 1000000
max_bytes: 10737418240
max_age: 168h
replicas: 3
retention: limits
storage: file
10.2 Performance Optimization
10.2.1 Argo CD Performance Tuning
# argocd-performance-tuning.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cm
namespace: argo
data:
# Parallel sync limit
controller.sync.concurrency.limit: "50"
# Status cache
controller.status.processors: "20"
controller.operation.processors: "10"
# Repo server parallelism limit
reposerver.parallelism.limit: "200"
# Timeout settings
timeout.reconciliation: "180s"
timeout.hard.reconciliation: "0"
# Resource health check interval
resource.healthcheck.interval: "10s"
# Resource comparison options
resource.compareoptions: |
ignoreAggregatedRoles: true
ignoreResourceStatusField: all
# Exclude resources that don't need tracking
resource.exclusions: |
- apiGroups:
- "cilium.io"
kinds:
- CiliumIdentity
clusters:
- "*"
- apiGroups:
- ""
kinds:
- "Event"
clusters:
- "*"
# Repository credentials template
repository.credentials: |
- url: https://github.com/myorg
usernameSecret:
name: github-creds
key: username
passwordSecret:
name: github-creds
key: password
---
# Application Controller tuning
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cmd-params-cm
namespace: argo
data:
# Controller settings
controller.status.processors: "50"
controller.operation.processors: "25"
controller.self.heal.timeout.seconds: "5"
controller.repo.server.timeout.seconds: "180"
# Repo Server settings
reposerver.parallelism.limit: "100"
reposerver.enable.git.submodule: "false"
# Server settings
server.insecure: "false"
server.enable.gzip: "true"
server.x.frame.options: "sameorigin"
10.2.2 Argo Workflows Performance Tuning
# workflows-performance.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-controller-configmap
namespace: argo
data:
config: |
# Parallelism limit
parallelism: 100
# Namespace parallelism limit
namespaceParallelism: 50
# Resource rate limit
resourceRateLimit:
limit: 500
burst: 1000
# Pod GC
podGCGracePeriodSeconds: 30
podGCDeleteDelayDuration: 5s
# Workflow archive
persistence:
archive: true
archiveTTL: 7d
postgresql:
host: postgres
port: 5432
database: argo
tableName: argo_workflows
userNameSecret:
name: argo-postgres-config
key: username
passwordSecret:
name: argo-postgres-config
key: password
# Node status offloading
nodeStatusOffLoad: true
# Controller resources
executorResources:
requests:
cpu: 100m
memory: 64Mi
limits:
cpu: 500m
memory: 512Mi
# Workflow defaults
workflowDefaults:
spec:
ttlStrategy:
secondsAfterCompletion: 3600
secondsAfterSuccess: 1800
secondsAfterFailure: 7200
podGC:
strategy: OnPodSuccess
deleteDelayDuration: 60s
# Use emissary executor
templateDefaults:
container:
imagePullPolicy: IfNotPresent
# Metrics
metricsConfig:
enabled: true
path: /metrics
port: 9090
metricsTTL: 10m
# Link configuration
links:
- name: Workflow Logs
scope: workflow
url: https://grafana.example.com/d/workflow-logs?var-workflow=${metadata.name}
10.2.3 Resource Limit Best Practices
# resource-limits.yaml
# Argo CD resource configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: argocd-server
namespace: argo
spec:
template:
spec:
containers:
- name: argocd-server
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 2
memory: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: argocd-repo-server
namespace: argo
spec:
template:
spec:
containers:
- name: argocd-repo-server
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 4
memory: 4Gi
---
# Workflow Pod resource templates
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: resource-template
namespace: argo
spec:
templates:
- name: small-task
container:
image: busybox
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
- name: medium-task
container:
image: busybox
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2
memory: 2Gi
- name: large-task
container:
image: busybox
resources:
requests:
cpu: 2
memory: 2Gi
limits:
cpu: 4
memory: 8Gi
- name: gpu-task
container:
image: nvidia/cuda:latest
resources:
requests:
cpu: 2
memory: 4Gi
nvidia.com/gpu: 1
limits:
cpu: 4
memory: 8Gi
nvidia.com/gpu: 1
10.3 Monitoring and Alerting
10.3.1 Prometheus Monitoring Configuration
# argo-monitoring.yaml
# ServiceMonitor for Argo CD
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: argocd-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: argocd-metrics
namespaceSelector:
matchNames:
- argo
endpoints:
- port: metrics
interval: 30s
---
# ServiceMonitor for Argo Workflows
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: argo-workflows-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app: workflow-controller
namespaceSelector:
matchNames:
- argo
endpoints:
- port: metrics
interval: 30s
---
# ServiceMonitor for Argo Rollouts
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: argo-rollouts-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: argo-rollouts-metrics
namespaceSelector:
matchNames:
- argo-rollouts
endpoints:
- port: metrics
interval: 30s
10.3.2 Grafana Dashboard
{
"dashboard": {
"title": "Argo Overview",
"panels": [
{
"title": "Argo CD - Application Sync Status",
"type": "stat",
"targets": [
{
"expr": "sum(argocd_app_info{sync_status=\"Synced\"})",
"legendFormat": "Synced"
},
{
"expr": "sum(argocd_app_info{sync_status=\"OutOfSync\"})",
"legendFormat": "Out of Sync"
}
]
},
{
"title": "Argo Workflows - Active Workflows",
"type": "graph",
"targets": [
{
"expr": "sum(argo_workflows_count{status=\"Running\"})",
"legendFormat": "Running"
},
{
"expr": "sum(argo_workflows_count{status=\"Pending\"})",
"legendFormat": "Pending"
}
]
},
{
"title": "Argo Rollouts - Rollout Status",
"type": "stat",
"targets": [
{
"expr": "sum(rollout_info{phase=\"Healthy\"})",
"legendFormat": "Healthy"
},
{
"expr": "sum(rollout_info{phase=\"Progressing\"})",
"legendFormat": "Progressing"
},
{
"expr": "sum(rollout_info{phase=\"Degraded\"})",
"legendFormat": "Degraded"
}
]
}
]
}
}
10.3.3 Alert Rules
# argo-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: argo-alerts
namespace: monitoring
spec:
groups:
- name: argo-cd
rules:
- alert: ArgoCDAppOutOfSync
expr: |
argocd_app_info{sync_status="OutOfSync"} == 1
for: 30m
labels:
severity: warning
annotations:
summary: "Application {{ $labels.name }} is out of sync"
description: "Application {{ $labels.name }} in project {{ $labels.project }} has been out of sync for more than 30 minutes."
- alert: ArgoCDAppHealthDegraded
expr: |
argocd_app_info{health_status="Degraded"} == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Application {{ $labels.name }} health is degraded"
description: "Application {{ $labels.name }} health status is Degraded."
- alert: ArgoCDSyncFailed
expr: |
argocd_app_sync_total{phase="Failed"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Application {{ $labels.name }} sync failed"
description: "Application {{ $labels.name }} sync operation failed."
- name: argo-workflows
rules:
- alert: WorkflowFailed
expr: |
increase(argo_workflows_count{status="Failed"}[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Workflow failed in namespace {{ $labels.namespace }}"
description: "One or more workflows have failed in the last 5 minutes."
- alert: WorkflowStuck
expr: |
argo_workflows_count{status="Running"} > 0
and
(time() - argo_workflow_start_time) > 7200
for: 10m
labels:
severity: warning
annotations:
summary: "Workflow running for too long"
description: "Workflow has been running for more than 2 hours."
- alert: WorkflowQueueBacklog
expr: |
argo_workflows_count{status="Pending"} > 50
for: 15m
labels:
severity: warning
annotations:
summary: "Workflow queue backlog is high"
description: "More than 50 workflows are pending execution."
- name: argo-rollouts
rules:
- alert: RolloutFailed
expr: |
rollout_info{phase="Failed"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Rollout {{ $labels.name }} has failed"
description: "Rollout {{ $labels.name }} in namespace {{ $labels.namespace }} has failed."
- alert: RolloutPaused
expr: |
rollout_info{phase="Paused"} == 1
for: 1h
labels:
severity: warning
annotations:
summary: "Rollout {{ $labels.name }} is paused"
description: "Rollout {{ $labels.name }} has been paused for more than 1 hour."
- alert: AnalysisRunFailed
expr: |
analysis_run_info{phase="Failed"} == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Analysis run failed for rollout {{ $labels.rollout }}"
description: "Analysis run has failed, rollback may be triggered."
10.4 Troubleshooting
10.4.1 Common Issue Diagnostics
#!/bin/bash
# argo-diagnose.sh - Argo diagnostic script
echo "=== Argo CD Diagnostics ==="
# Check Argo CD component status
echo "Checking Argo CD components..."
kubectl get pods -n argo -l app.kubernetes.io/part-of=argocd
# Check application sync status
echo "Checking application sync status..."
kubectl get applications -n argo -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status
# Check applications with sync errors
echo "Applications with sync errors..."
kubectl get applications -n argo -o json | jq -r '.items[] | select(.status.sync.status != "Synced") | "\(.metadata.name): \(.status.conditions[0].message)"'
echo "=== Argo Workflows Diagnostics ==="
# Check Workflow Controller
echo "Checking Workflow Controller..."
kubectl get pods -n argo -l app=workflow-controller
kubectl logs -n argo -l app=workflow-controller --tail=50
# Check failed workflows
echo "Failed workflows..."
kubectl get workflows -n argo --field-selector status.phase=Failed
# Check pending workflows
echo "Pending workflows..."
kubectl get workflows -n argo --field-selector status.phase=Pending
echo "=== Argo Rollouts Diagnostics ==="
# Check Rollout status
echo "Checking Rollouts..."
kubectl get rollouts -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase
# Check failed AnalysisRuns
echo "Failed AnalysisRuns..."
kubectl get analysisruns -A --field-selector status.phase=Failed
echo "=== Argo Events Diagnostics ==="
# Check EventBus
echo "Checking EventBus..."
kubectl get eventbus -n argo-events
# Check EventSource
echo "Checking EventSources..."
kubectl get eventsources -n argo-events
# Check Sensor
echo "Checking Sensors..."
kubectl get sensors -n argo-events
10.4.2 Log Analysis
# Log collection configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level info
Daemon off
Parsers_File parsers.conf
[INPUT]
Name tail
Tag argo.*
Path /var/log/containers/argo*.log
Parser docker
DB /var/log/flb_argo.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match argo.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
[OUTPUT]
Name es
Match argo.*
Host elasticsearch
Port 9200
Index argo-logs
Type _doc
10.4.3 Common Problem Solutions
# Problem 1: Application stuck in "Progressing"
# Solution: Check resource health status
# Force refresh application
# argocd app get <app-name> --refresh
# Problem 2: Workflow Pod OOMKilled
# Solution: Increase resource limits
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
templates:
- name: memory-intensive
container:
resources:
requests:
memory: 2Gi
limits:
memory: 4Gi
# Problem 3: Rollout stuck at Paused
# Solution: Manually promote or cancel
# kubectl argo rollouts promote <rollout-name>
# kubectl argo rollouts abort <rollout-name>
# Problem 4: EventSource not receiving events
# Solution: Check network and port configuration
apiVersion: v1
kind: Service
metadata:
name: webhook-eventsource-svc
namespace: argo-events
spec:
type: LoadBalancer # or use Ingress
ports:
- port: 12000
targetPort: 12000
selector:
eventsource-name: webhook
10.5 Backup and Recovery
10.5.1 Argo CD Backup
#!/bin/bash
# argocd-backup.sh
BACKUP_DIR="/backup/argocd/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# Backup all Applications
echo "Backing up Applications..."
kubectl get applications -n argo -o yaml > $BACKUP_DIR/applications.yaml
# Backup all AppProjects
echo "Backing up AppProjects..."
kubectl get appprojects -n argo -o yaml > $BACKUP_DIR/appprojects.yaml
# Backup ConfigMaps
echo "Backing up ConfigMaps..."
kubectl get configmaps -n argo -o yaml > $BACKUP_DIR/configmaps.yaml
# Backup Secrets (note security)
echo "Backing up Secrets..."
kubectl get secrets -n argo -o yaml > $BACKUP_DIR/secrets.yaml
# Backup RBAC configuration
echo "Backing up RBAC..."
kubectl get configmap argocd-rbac-cm -n argo -o yaml > $BACKUP_DIR/rbac.yaml
# Backup repository credentials
echo "Backing up repository credentials..."
kubectl get secrets -n argo -l argocd.argoproj.io/secret-type=repository -o yaml > $BACKUP_DIR/repo-creds.yaml
# Compress backup
tar -czf $BACKUP_DIR.tar.gz -C /backup/argocd $(date +%Y%m%d)
echo "Backup completed: $BACKUP_DIR.tar.gz"
10.5.2 Argo Workflows Backup
#!/bin/bash
# workflows-backup.sh
BACKUP_DIR="/backup/workflows/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# Backup WorkflowTemplates
echo "Backing up WorkflowTemplates..."
kubectl get workflowtemplates -A -o yaml > $BACKUP_DIR/workflowtemplates.yaml
# Backup ClusterWorkflowTemplates
echo "Backing up ClusterWorkflowTemplates..."
kubectl get clusterworkflowtemplates -o yaml > $BACKUP_DIR/clusterworkflowtemplates.yaml
# Backup CronWorkflows
echo "Backing up CronWorkflows..."
kubectl get cronworkflows -A -o yaml > $BACKUP_DIR/cronworkflows.yaml
# Backup ConfigMap
echo "Backing up ConfigMap..."
kubectl get configmap workflow-controller-configmap -n argo -o yaml > $BACKUP_DIR/controller-config.yaml
echo "Backup completed: $BACKUP_DIR"
10.5.3 Disaster Recovery Process
# disaster-recovery-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: disaster-recovery
namespace: argo
spec:
entrypoint: recovery-pipeline
arguments:
parameters:
- name: backup-location
value: "s3://backups/argo/latest"
templates:
- name: recovery-pipeline
dag:
tasks:
- name: verify-backup
template: verify-backup
- name: restore-argocd
dependencies: [verify-backup]
template: restore-argocd
- name: restore-workflows
dependencies: [verify-backup]
template: restore-workflows
- name: restore-events
dependencies: [verify-backup]
template: restore-events
- name: verify-recovery
dependencies: [restore-argocd, restore-workflows, restore-events]
template: verify-recovery
- name: verify-backup
container:
image: amazon/aws-cli:latest
command: ["/bin/sh", "-c"]
args:
- |
aws s3 ls {{workflow.parameters.backup-location}} || exit 1
echo "Backup verified"
- name: restore-argocd
container:
image: bitnami/kubectl:latest
command: ["/bin/sh", "-c"]
args:
- |
# Download backup
aws s3 cp {{workflow.parameters.backup-location}}/argocd/ /tmp/argocd/ --recursive
# Restore AppProjects
kubectl apply -f /tmp/argocd/appprojects.yaml
# Restore Applications
kubectl apply -f /tmp/argocd/applications.yaml
# Restore repository credentials
kubectl apply -f /tmp/argocd/repo-creds.yaml
echo "Argo CD restored"
- name: restore-workflows
container:
image: bitnami/kubectl:latest
command: ["/bin/sh", "-c"]
args:
- |
aws s3 cp {{workflow.parameters.backup-location}}/workflows/ /tmp/workflows/ --recursive
kubectl apply -f /tmp/workflows/workflowtemplates.yaml
kubectl apply -f /tmp/workflows/clusterworkflowtemplates.yaml
kubectl apply -f /tmp/workflows/cronworkflows.yaml
echo "Argo Workflows restored"
- name: restore-events
container:
image: bitnami/kubectl:latest
command: ["/bin/sh", "-c"]
args:
- |
aws s3 cp {{workflow.parameters.backup-location}}/events/ /tmp/events/ --recursive
kubectl apply -f /tmp/events/
echo "Argo Events restored"
- name: verify-recovery
container:
image: bitnami/kubectl:latest
command: ["/bin/sh", "-c"]
args:
- |
# Verify Argo CD
kubectl get applications -n argo
argocd app list
# Verify Workflows
kubectl get workflowtemplates -A
kubectl get cronworkflows -A
# Verify Events
kubectl get eventsources -n argo-events
kubectl get sensors -n argo-events
echo "Recovery verified"
10.6 Operations Checklist
10.6.1 Daily Maintenance Tasks
# daily-maintenance-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: daily-maintenance
namespace: argo
spec:
schedule: "0 2 * * *" # Daily at 2 AM
workflowSpec:
entrypoint: maintenance
templates:
- name: maintenance
dag:
tasks:
- name: cleanup-workflows
template: cleanup-workflows
- name: check-health
template: health-check
- name: backup
template: backup
dependencies: [check-health]
- name: report
template: generate-report
dependencies: [cleanup-workflows, backup]
- name: cleanup-workflows
container:
image: bitnami/kubectl:latest
command: ["/bin/sh", "-c"]
args:
- |
# Clean workflows completed more than 7 days ago
kubectl delete workflows -n argo \
--field-selector status.phase=Succeeded \
--selector "!workflows.argoproj.io/completed"
# Clean failed workflows older than 30 days
kubectl get workflows -n argo -o json | \
jq -r '.items[] | select(.status.phase=="Failed") | select(.status.finishedAt | fromdateiso8601 < (now - 2592000)) | .metadata.name' | \
xargs -r kubectl delete workflow -n argo
- name: health-check
container:
image: curlimages/curl:latest
command: ["/bin/sh", "-c"]
args:
- |
# Check Argo CD
curl -f http://argocd-server.argo:80/healthz || exit 1
# Check Argo Server
curl -f http://argo-server.argo:2746/api/v1/info || exit 1
echo "All services healthy"
- name: backup
container:
image: amazon/aws-cli:latest
command: ["/bin/sh", "-c"]
args:
- |
DATE=$(date +%Y%m%d)
kubectl get applications,appprojects -n argo -o yaml | \
aws s3 cp - s3://backups/argo/$DATE/argocd.yaml
kubectl get workflowtemplates,cronworkflows -A -o yaml | \
aws s3 cp - s3://backups/argo/$DATE/workflows.yaml
- name: generate-report
container:
image: curlimages/curl:latest
command: ["/bin/sh", "-c"]
args:
- |
# Send daily report to Slack
curl -X POST $SLACK_WEBHOOK \
-H "Content-Type: application/json" \
-d '{
"text": "Daily Maintenance Report",
"attachments": [{
"color": "good",
"text": "All maintenance tasks completed successfully"
}]
}'
10.6.2 Upgrade Checklist
# Argo Upgrade Checklist
## Pre-Upgrade Preparation
- [ ] Backup all configurations and data
- [ ] Check version compatibility and changelog
- [ ] Validate upgrade in test environment
- [ ] Notify relevant teams
## Upgrade Steps
- [ ] Pause auto-sync
- [ ] Upgrade CRDs
- [ ] Upgrade controller
- [ ] Upgrade UI/Server
- [ ] Verify functionality
## Post-Upgrade Verification
- [ ] Check all component status
- [ ] Verify Application sync
- [ ] Verify Workflow execution
- [ ] Verify Rollout functionality
- [ ] Check monitoring and alerting
- [ ] Resume auto-sync
10.7 Chapter Summary
This chapter summarized the best practices for the Argo ecosystem in production environments:
Key Points:
- High Availability: Use multi-replicas, anti-affinity, and leader election to ensure availability
- Performance Optimization: Properly configure parallelism limits, resource limits, and caching
- Monitoring & Alerting: Establish comprehensive monitoring and alerting system
- Troubleshooting: Prepare diagnostic tools and common problem solutions
- Backup & Recovery: Regular backups and test recovery processes
- Daily Operations: Automate daily maintenance tasks
Course Summary
Through this course, you have mastered the core components and best practices of the Argo ecosystem:
- Argo CD: GitOps continuous deployment
- Argo Workflows: Cloud-native workflow engine
- Argo Rollouts: Progressive delivery
- Argo Events: Event-driven automation
By combining these components, you can build a complete cloud-native CI/CD platform, enabling fully automated processes from code commit to production deployment.
Best wishes on your cloud-native journey!