โญ Featured Article

Advanced CI/CD Pipelines: GitOps and Infrastructure Automation at Scale

Build sophisticated CI/CD pipelines using GitOps principles, automated testing, and infrastructure as code for enterprise-scale deployments.

๐Ÿ“… June 8, 2024โฑ๏ธ 22 min read
#DevOps#CI/CD#GitOps#Automation#Infrastructure

Advanced CI/CD Pipelines: GitOps and Infrastructure Automation at Scale

The Evolution of DevOps

Modern software delivery has evolved far beyond simple build-and-deploy scripts. Today's enterprises demand zero-downtime deployments, automated rollbacks, comprehensive testing, and infrastructure that scales automatically. This is where advanced CI/CD pipelines combined with GitOps principles become game-changers.

In this comprehensive guide, we'll build a complete enterprise-grade CI/CD system that handles everything from code commits to production deployments, with safety nets, monitoring, and automated recovery mechanisms.

Why Advanced CI/CD Matters

๐Ÿš€ Business Impact

  • Faster Time to Market: Deploy features in minutes, not hours
  • Reduced Risk: Automated testing and rollback capabilities
  • Higher Reliability: Consistent, repeatable deployments
  • Developer Productivity: Focus on code, not deployment complexity

๐Ÿ›ก๏ธ Technical Benefits

  • Zero-Downtime Deployments: Blue-green and canary deployment strategies
  • Automated Testing: Comprehensive testing at every stage
  • Infrastructure as Code: Version-controlled, auditable infrastructure
  • Observability: Complete visibility into deployment pipeline

GitOps Architecture Overview

Our advanced CI/CD system follows GitOps principles:

Complete CI/CD Pipeline Implementation

GitHub Actions Workflow

# .github/workflows/ci-cd.yml
name: Advanced CI/CD Pipeline

on:
  push:
    branches: [main, develop]
    paths-ignore:
      - 'docs/**'
      - '*.md'
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}
  NODE_VERSION: '18'
  PYTHON_VERSION: '3.11'

jobs:
  # ================================
  # CONTINUOUS INTEGRATION
  # ================================

  code-quality:
    name: Code Quality & Security
    runs-on: ubuntu-latest
    outputs:
      should-deploy: ${{ steps.changes.outputs.should-deploy }}

    steps:
    - name: Checkout code
      uses: actions/checkout@v4
      with:
        fetch-depth: 0  # Needed for proper diff analysis

    - name: Detect changes
      id: changes
      uses: dorny/paths-filter@v2
      with:
        filters: |
          should-deploy:
            - 'src/**'
            - 'package*.json'
            - 'Dockerfile'
            - '.github/workflows/**'

    - name: Set up Node.js
      uses: actions/setup-node@v4
      with:
        node-version: ${{ env.NODE_VERSION }}
        cache: 'npm'

    - name: Install dependencies
      run: |
        npm ci
        npm audit --audit-level high

    - name: Lint code
      run: |
        npm run lint
        npm run format:check

    - name: Type checking
      run: npm run type-check

    - name: Security scanning (Semgrep)
      uses: returntocorp/semgrep-action@v1
      with:
        publishToken: ${{ secrets.SEMGREP_APP_TOKEN }}

    - name: Dependency vulnerability scan
      run: npm audit --audit-level high

    - name: SonarCloud Scan
      uses: SonarSource/sonarcloud-github-action@master
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}

  unit-tests:
    name: Unit & Integration Tests
    runs-on: ubuntu-latest
    needs: code-quality

    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: testdb
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432

      redis:
        image: redis:7
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 6379:6379

    strategy:
      matrix:
        node-version: [16, 18, 20]

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Set up Node.js ${{ matrix.node-version }}
      uses: actions/setup-node@v4
      with:
        node-version: ${{ matrix.node-version }}
        cache: 'npm'

    - name: Install dependencies
      run: npm ci

    - name: Run unit tests
      run: npm run test:unit -- --coverage --watchAll=false
      env:
        DATABASE_URL: postgresql://postgres:postgres@localhost:5432/testdb
        REDIS_URL: redis://localhost:6379

    - name: Run integration tests
      run: npm run test:integration
      env:
        DATABASE_URL: postgresql://postgres:postgres@localhost:5432/testdb
        REDIS_URL: redis://localhost:6379
        NODE_ENV: test

    - name: Upload coverage to Codecov
      uses: codecov/codecov-action@v3
      with:
        file: ./coverage/lcov.info
        flags: unittests
        name: codecov-umbrella

  performance-tests:
    name: Performance & Load Tests
    runs-on: ubuntu-latest
    needs: [code-quality, unit-tests]

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Build application
      run: |
        npm ci
        npm run build

    - name: Start application
      run: |
        npm start &
        sleep 30  # Wait for app to start
      env:
        NODE_ENV: production
        PORT: 3000

    - name: Run Lighthouse CI
      uses: treosh/lighthouse-ci-action@v9
      with:
        urls: |
          http://localhost:3000
          http://localhost:3000/api/health
        configPath: './lighthouse.config.js'
        uploadArtifacts: true

    - name: Load testing with Artillery
      run: |
        npm install -g artillery
        artillery run load-test.yml

    - name: Performance regression check
      run: |
        # Compare metrics with baseline
        node scripts/performance-check.js

  security-tests:
    name: Security Testing
    runs-on: ubuntu-latest
    needs: code-quality

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Build Docker image for security scanning
      run: |
        docker build -t security-test:latest .

    - name: Container vulnerability scan
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: 'security-test:latest'
        format: 'sarif'
        output: 'trivy-results.sarif'

    - name: Upload Trivy scan results
      uses: github/codeql-action/upload-sarif@v2
      with:
        sarif_file: 'trivy-results.sarif'

    - name: OWASP ZAP security scan
      uses: zaproxy/[email protected]
      with:
        target: 'http://localhost:3000'
        docker_name: 'owasp/zap2docker-stable'

  # ================================
  # BUILD & PACKAGE
  # ================================

  build-and-push:
    name: Build & Push Container
    runs-on: ubuntu-latest
    needs: [code-quality, unit-tests, performance-tests, security-tests]
    if: needs.code-quality.outputs.should-deploy == 'true'

    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
      image-digest: ${{ steps.build.outputs.digest }}

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3

    - name: Log in to Container Registry
      uses: docker/login-action@v3
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}

    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=sha,prefix={{branch}}-
          type=raw,value=latest,enable={{is_default_branch}}

    - name: Build and push Docker image
      id: build
      uses: docker/build-push-action@v5
      with:
        context: .
        platforms: linux/amd64,linux/arm64
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}
        cache-from: type=gha
        cache-to: type=gha,mode=max
        build-args: |
          BUILD_DATE=${{ steps.meta.outputs.labels['org.opencontainers.image.created'] }}
          VCS_REF=${{ github.sha }}

    - name: Sign container image
      uses: sigstore/cosign-installer@v3

    - name: Sign the published Docker image
      env:
        COSIGN_EXPERIMENTAL: 1
      run: |
        echo "${{ steps.meta.outputs.tags }}" | xargs -I {} cosign sign {}@${{ steps.build.outputs.digest }}

  # ================================
  # CONTINUOUS DEPLOYMENT
  # ================================

  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: [build-and-push]
    if: github.ref == 'refs/heads/main'
    environment:
      name: staging
      url: https://staging.myapp.com

    steps:
    - name: Checkout GitOps repo
      uses: actions/checkout@v4
      with:
        repository: myorg/gitops-config
        token: ${{ secrets.GITOPS_TOKEN }}
        path: gitops

    - name: Update staging manifest
      run: |
        cd gitops

        # Update the staging deployment with new image
        yq eval '.spec.template.spec.containers[0].image = "${{ needs.build-and-push.outputs.image-tag }}"' \
          -i staging/deployment.yaml

        # Update configmap with build info
        yq eval '.data.BUILD_SHA = "${{ github.sha }}"' \
          -i staging/configmap.yaml
        yq eval '.data.BUILD_DATE = "${{ github.event.head_commit.timestamp }}"' \
          -i staging/configmap.yaml

    - name: Commit and push changes
      run: |
        cd gitops
        git config user.name "GitHub Actions"
        git config user.email "[email protected]"
        git add .
        git commit -m "Deploy ${{ github.sha }} to staging"
        git push

    - name: Wait for deployment
      run: |
        # Wait for ArgoCD to sync and deploy
        sleep 60

    - name: Health check
      run: |
        # Verify deployment is healthy
        curl -f https://staging.myapp.com/health || exit 1

    - name: Run smoke tests
      run: |
        npm ci
        npm run test:smoke -- --baseUrl=https://staging.myapp.com

  e2e-tests:
    name: End-to-End Tests
    runs-on: ubuntu-latest
    needs: deploy-staging

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Run Playwright tests
      uses: microsoft/playwright-github-action@v1

    - name: Install dependencies
      run: npm ci

    - name: Run E2E tests
      run: npm run test:e2e
      env:
        BASE_URL: https://staging.myapp.com
        E2E_USERNAME: ${{ secrets.E2E_USERNAME }}
        E2E_PASSWORD: ${{ secrets.E2E_PASSWORD }}

    - name: Upload test results
      uses: actions/upload-artifact@v3
      if: always()
      with:
        name: playwright-results
        path: test-results/

  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: [build-and-push, e2e-tests]
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://myapp.com

    steps:
    - name: Checkout GitOps repo
      uses: actions/checkout@v4
      with:
        repository: myorg/gitops-config
        token: ${{ secrets.GITOPS_TOKEN }}
        path: gitops

    - name: Create production deployment PR
      run: |
        cd gitops

        # Create a new branch for production deployment
        git checkout -b deploy-prod-${{ github.sha }}

        # Update production manifest
        yq eval '.spec.template.spec.containers[0].image = "${{ needs.build-and-push.outputs.image-tag }}"' \
          -i production/deployment.yaml

        # Update configmap
        yq eval '.data.BUILD_SHA = "${{ github.sha }}"' \
          -i production/configmap.yaml
        yq eval '.data.BUILD_DATE = "${{ github.event.head_commit.timestamp }}"' \
          -i production/configmap.yaml

        # Commit changes
        git config user.name "GitHub Actions"
        git config user.email "[email protected]"
        git add .
        git commit -m "Deploy ${{ github.sha }} to production"
        git push -u origin deploy-prod-${{ github.sha }}

        # Create PR for review
        gh pr create \
          --title "๐Ÿš€ Deploy ${{ github.sha }} to Production" \
          --body "Automated deployment PR for commit ${{ github.sha }}" \
          --reviewer team:platform-team \
          --base main \
          --head deploy-prod-${{ github.sha }}
      env:
        GITHUB_TOKEN: ${{ secrets.GITOPS_TOKEN }}

  post-deployment:
    name: Post-Deployment Verification
    runs-on: ubuntu-latest
    needs: deploy-production

    steps:
    - name: Production health check
      run: |
        # Wait for deployment to complete
        sleep 120

        # Comprehensive health checks
        curl -f https://myapp.com/health
        curl -f https://myapp.com/api/health
        curl -f https://myapp.com/metrics

    - name: Performance validation
      run: |
        # Run production performance tests
        npm ci
        npm run test:performance:prod

    - name: Update deployment status
      uses: deployments/create-deployment-status@v1
      with:
        token: ${{ secrets.GITHUB_TOKEN }}
        deployment-id: ${{ needs.deploy-production.outputs.deployment-id }}
        state: success
        environment-url: https://myapp.com

    - name: Notify team
      uses: 8398a7/action-slack@v3
      with:
        status: success
        text: "๐Ÿš€ Successfully deployed ${{ github.sha }} to production!"
      env:
        SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
yaml

Advanced Docker Configuration

# Multi-stage Dockerfile for optimized builds
FROM node:18-alpine AS base
RUN apk add --no-cache dumb-init
WORKDIR /usr/src/app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force

# Development stage
FROM base AS dev
ENV NODE_ENV=development
RUN npm ci
COPY . .
CMD ["dumb-init", "npm", "run", "dev"]

# Build stage
FROM base AS build
ENV NODE_ENV=production
COPY . .
RUN npm ci && npm run build && npm prune --production

# Production stage
FROM node:18-alpine AS production
RUN apk add --no-cache dumb-init curl

# Create app user
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001

WORKDIR /usr/src/app
COPY --from=build --chown=nodejs:nodejs /usr/src/app/dist ./dist
COPY --from=build --chown=nodejs:nodejs /usr/src/app/node_modules ./node_modules
COPY --from=build --chown=nodejs:nodejs /usr/src/app/package*.json ./

# Security labels
LABEL org.opencontainers.image.source="https://github.com/myorg/myapp"
LABEL org.opencontainers.image.vendor="MyOrg"
LABEL org.opencontainers.image.licenses="MIT"

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:3000/health || exit 1

USER nodejs
EXPOSE 3000

CMD ["dumb-init", "node", "dist/index.js"]
dockerfile

Kubernetes Deployment Configuration

# k8s/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: production
  labels:
    app: myapp
    version: v1
  annotations:
    deployment.kubernetes.io/revision: "1"
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "3000"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001
      containers:
      - name: myapp
        image: ghcr.io/myorg/myapp:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 3000
          name: http
          protocol: TCP
        env:
        - name: NODE_ENV
          value: "production"
        - name: PORT
          value: "3000"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: database-url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: redis-url
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: http
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: http
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /health
            port: http
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30
        volumeMounts:
        - name: app-config
          mountPath: /usr/src/app/config
          readOnly: true
        - name: temp-storage
          mountPath: /tmp
      volumes:
      - name: app-config
        configMap:
          name: app-config
      - name: temp-storage
        emptyDir: {}
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - myapp
              topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
  name: myapp-service
  namespace: production
  labels:
    app: myapp
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: http
    protocol: TCP
    name: http
  selector:
    app: myapp
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
  tls:
  - hosts:
    - myapp.com
    - www.myapp.com
    secretName: myapp-tls
  rules:
  - host: myapp.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-service
            port:
              number: 80
  - host: www.myapp.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-service
            port:
              number: 80
yaml

ArgoCD Application Configuration

# argocd/applications/myapp-production.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp-production
  namespace: argocd
  labels:
    environment: production
    team: platform
spec:
  project: production
  source:
    repoURL: https://github.com/myorg/gitops-config
    targetRevision: HEAD
    path: production
    helm:
      valueFiles:
      - values.yaml
      - values-production.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
    - Validate=true
    - CreateNamespace=false
    - PrunePropagationPolicy=foreground
    - PruneLast=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m0s
  revisionHistoryLimit: 10
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /spec/replicas
  info:
  - name: "Environment"
    value: "Production"
  - name: "Team"
    value: "Platform Team"
---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: production
  namespace: argocd
spec:
  description: "Production applications"
  sourceRepos:
  - 'https://github.com/myorg/gitops-config'
  destinations:
  - namespace: production
    server: https://kubernetes.default.svc
  clusterResourceWhitelist:
  - group: ''
    kind: Namespace
  - group: 'rbac.authorization.k8s.io'
    kind: ClusterRole
  - group: 'rbac.authorization.k8s.io'
    kind: ClusterRoleBinding
  namespaceResourceWhitelist:
  - group: '*'
    kind: '*'
  roles:
  - name: production-admin
    description: "Production admin access"
    policies:
    - p, proj:production:production-admin, applications, *, production/*, allow
    - p, proj:production:production-admin, repositories, *, *, allow
    groups:
    - myorg:platform-team
  - name: production-readonly
    description: "Production read-only access"
    policies:
    - p, proj:production:production-readonly, applications, get, production/*, allow
    groups:
    - myorg:developers
yaml

Advanced Monitoring & Alerting

# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-alerts
  namespace: production
spec:
  groups:
  - name: myapp.deployment
    interval: 30s
    rules:
    - alert: DeploymentReplicasMismatch
      expr: |
        (
          kube_deployment_spec_replicas{namespace="production",deployment="myapp"}
          !=
          kube_deployment_status_replicas_available{namespace="production",deployment="myapp"}
        ) > 0
      for: 5m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "Deployment replica mismatch for {{ $labels.deployment }}"
        description: "Deployment {{ $labels.deployment }} has {{ $value }} replica(s) unavailable"

    - alert: HighPodRestartRate
      expr: |
        rate(kube_pod_container_status_restarts_total{namespace="production"}[5m]) > 0.1
      for: 5m
      labels:
        severity: critical
        team: platform
      annotations:
        summary: "High pod restart rate for {{ $labels.pod }}"
        description: "Pod {{ $labels.pod }} is restarting frequently"

  - name: myapp.performance
    interval: 30s
    rules:
    - alert: HighResponseTime
      expr: |
        histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="myapp"}[5m])) > 0.5
      for: 2m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "High response time for myapp"
        description: "95th percentile response time is {{ $value }}s"

    - alert: HighErrorRate
      expr: |
        (
          rate(http_requests_total{job="myapp",status=~"5.."}[5m])
          /
          rate(http_requests_total{job="myapp"}[5m])
        ) > 0.05
      for: 2m
      labels:
        severity: critical
        team: platform
      annotations:
        summary: "High error rate for myapp"
        description: "Error rate is {{ $value | humanizePercentage }}"

  - name: myapp.resources
    interval: 30s
    rules:
    - alert: HighMemoryUsage
      expr: |
        (
          container_memory_working_set_bytes{namespace="production",container="myapp"}
          /
          container_spec_memory_limit_bytes{namespace="production",container="myapp"}
        ) > 0.8
      for: 5m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "High memory usage for {{ $labels.pod }}"
        description: "Memory usage is {{ $value | humanizePercentage }}"

    - alert: HighCPUUsage
      expr: |
        (
          rate(container_cpu_usage_seconds_total{namespace="production",container="myapp"}[5m])
          /
          container_spec_cpu_quota{namespace="production",container="myapp"}
        ) * 100000 > 80
      for: 5m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "High CPU usage for {{ $labels.pod }}"
        description: "CPU usage is {{ $value }}%"
---
# Grafana Dashboard Configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: myapp-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "MyApp Production Dashboard",
        "panels": [
          {
            "title": "Request Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(http_requests_total{job=\"myapp\"}[5m])",
                "legendFormat": "{{ method }} {{ status }}"
              }
            ]
          },
          {
            "title": "Response Time",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m]))",
                "legendFormat": "50th percentile"
              },
              {
                "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m]))",
                "legendFormat": "95th percentile"
              }
            ]
          }
        ]
      }
    }
yaml

Automated Rollback System

# scripts/auto-rollback.py
import os
import requests
import subprocess
import time
from datetime import datetime, timedelta
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class AutoRollback:
    def __init__(self):
        self.prometheus_url = os.getenv('PROMETHEUS_URL', 'http://prometheus:9090')
        self.argocd_url = os.getenv('ARGOCD_URL', 'http://argocd:8080')
        self.argocd_token = os.getenv('ARGOCD_TOKEN')
        self.app_name = os.getenv('APP_NAME', 'myapp-production')
        self.namespace = os.getenv('NAMESPACE', 'production')

        # Rollback thresholds
        self.error_rate_threshold = 0.05  # 5%
        self.response_time_threshold = 1.0  # 1 second
        self.check_interval = 60  # seconds
        self.rollback_window = 300  # 5 minutes after deployment

    def check_deployment_health(self):
        """Check if the current deployment is healthy"""
        try:
            # Get deployment time
            deployment_time = self.get_last_deployment_time()
            if not deployment_time:
                return True

            # Only check if deployment is within rollback window
            time_diff = datetime.now() - deployment_time
            if time_diff > timedelta(seconds=self.rollback_window):
                return True

            # Check error rate
            error_rate = self.get_error_rate()
            if error_rate > self.error_rate_threshold:
                logger.error(f"High error rate detected: {error_rate:.2%}")
                return False

            # Check response time
            response_time = self.get_response_time()
            if response_time > self.response_time_threshold:
                logger.error(f"High response time detected: {response_time:.2f}s")
                return False

            # Check pod health
            unhealthy_pods = self.get_unhealthy_pods()
            if unhealthy_pods > 0:
                logger.error(f"Unhealthy pods detected: {unhealthy_pods}")
                return False

            logger.info("Deployment health check passed")
            return True

        except Exception as e:
            logger.error(f"Health check failed: {str(e)}")
            return False

    def get_error_rate(self):
        """Get current error rate from Prometheus"""
        query = f'''
        (
          rate(http_requests_total{{namespace="{self.namespace}",status=~"5.."}}[5m])
          /
          rate(http_requests_total{{namespace="{self.namespace}"}}[5m])
        )
        '''

        response = requests.get(
            f'{self.prometheus_url}/api/v1/query',
            params={'query': query}
        )

        data = response.json()
        if data['data']['result']:
            return float(data['data']['result'][0]['value'][1])
        return 0.0

    def get_response_time(self):
        """Get 95th percentile response time"""
        query = f'''
        histogram_quantile(0.95,
          rate(http_request_duration_seconds_bucket{{namespace="{self.namespace}"}}[5m])
        )
        '''

        response = requests.get(
            f'{self.prometheus_url}/api/v1/query',
            params={'query': query}
        )

        data = response.json()
        if data['data']['result']:
            return float(data['data']['result'][0]['value'][1])
        return 0.0

    def get_unhealthy_pods(self):
        """Get number of unhealthy pods"""
        query = f'''
        kube_pod_status_ready{{namespace="{self.namespace}",condition="false"}}
        '''

        response = requests.get(
            f'{self.prometheus_url}/api/v1/query',
            params={'query': query}
        )

        data = response.json()
        return len(data['data']['result'])

    def get_last_deployment_time(self):
        """Get timestamp of last deployment"""
        try:
            result = subprocess.run([
                'kubectl', 'get', 'deployment', 'myapp',
                '-n', self.namespace,
                '-o', 'jsonpath={.metadata.annotations.deployment\.kubernetes\.io/revision}'
            ], capture_output=True, text=True, check=True)

            # This is a simplified implementation
            # In practice, you'd parse the actual deployment timestamp
            return datetime.now() - timedelta(minutes=2)
        except:
            return None

    def rollback_deployment(self):
        """Rollback to previous version using ArgoCD"""
        try:
            logger.info(f"Initiating rollback for {self.app_name}")

            # Get application history
            headers = {'Authorization': f'Bearer {self.argocd_token}'}
            response = requests.get(
                f'{self.argocd_url}/api/v1/applications/{self.app_name}/revisions',
                headers=headers
            )

            revisions = response.json()
            if len(revisions) < 2:
                logger.error("No previous revision available for rollback")
                return False

            # Get previous revision
            previous_revision = revisions[1]['revision']

            # Rollback application
            rollback_request = {
                'revision': previous_revision,
                'prune': True,
                'dryRun': False
            }

            response = requests.post(
                f'{self.argocd_url}/api/v1/applications/{self.app_name}/rollback',
                headers=headers,
                json=rollback_request
            )

            if response.status_code == 200:
                logger.info(f"Rollback initiated successfully to revision {previous_revision}")
                self.send_rollback_notification(previous_revision)
                return True
            else:
                logger.error(f"Rollback failed: {response.text}")
                return False

        except Exception as e:
            logger.error(f"Rollback operation failed: {str(e)}")
            return False

    def send_rollback_notification(self, revision):
        """Send notification about rollback"""
        webhook_url = os.getenv('SLACK_WEBHOOK_URL')
        if not webhook_url:
            return

        message = {
            "text": f"๐Ÿšจ Automatic rollback triggered for {self.app_name}",
            "blocks": [
                {
                    "type": "section",
                    "text": {
                        "type": "mrkdwn",
                        "text": f"*Automatic Rollback Executed*\n\n"
                                f"โ€ข Application: `{self.app_name}`\n"
                                f"โ€ข Rolled back to: `{revision}`\n"
                                f"โ€ข Reason: Health check failure\n"
                                f"โ€ข Time: {datetime.now().isoformat()}"
                    }
                }
            ]
        }

        requests.post(webhook_url, json=message)

    def monitor_and_rollback(self):
        """Main monitoring loop"""
        logger.info(f"Starting health monitoring for {self.app_name}")

        while True:
            try:
                if not self.check_deployment_health():
                    logger.warning("Health check failed, initiating rollback")
                    if self.rollback_deployment():
                        logger.info("Rollback completed successfully")
                        # Exit after rollback
                        break
                    else:
                        logger.error("Rollback failed, manual intervention required")

                time.sleep(self.check_interval)

            except KeyboardInterrupt:
                logger.info("Monitoring stopped by user")
                break
            except Exception as e:
                logger.error(f"Monitoring error: {str(e)}")
                time.sleep(self.check_interval)


if __name__ == "__main__":
    monitor = AutoRollback()
    monitor.monitor_and_rollback()
python

Infrastructure as Code with Terraform

# infrastructure/terraform/main.tf
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.16"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.8"
    }
  }

  backend "s3" {
    bucket  = "myorg-terraform-state"
    key     = "production/terraform.tfstate"
    region  = "us-west-2"
    encrypt = true
  }
}

# EKS Cluster
module "eks" {
  source = "terraform-aws-modules/eks/aws"

  cluster_name    = var.cluster_name
  cluster_version = "1.28"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true
  cluster_endpoint_public_access_cidrs = ["0.0.0.0/0"]

  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent = true
    }
    aws-ebs-csi-driver = {
      most_recent = true
    }
  }

  eks_managed_node_groups = {
    main = {
      min_size       = 3
      max_size       = 10
      desired_size   = 5
      instance_types = ["m5.large", "m5.xlarge"]

      k8s_labels = {
        Environment = "production"
        Application = "myapp"
      }

      tags = {
        ExtraTag = "production-nodes"
      }
    }
  }

  tags = var.common_tags
}

# ArgoCD Installation
resource "helm_release" "argocd" {
  name       = "argocd"
  repository = "https://argoproj.github.io/argo-helm"
  chart      = "argo-cd"
  namespace  = "argocd"
  version    = "5.51.6"

  create_namespace = true

  values = [
    templatefile("${path.module}/argocd-values.yaml", {
      hostname = "argocd.${var.domain_name}"
    })
  ]

  depends_on = [module.eks]
}

# Prometheus and Grafana
resource "helm_release" "kube_prometheus_stack" {
  name       = "kube-prometheus-stack"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "kube-prometheus-stack"
  namespace  = "monitoring"
  version    = "55.5.0"

  create_namespace = true

  values = [
    templatefile("${path.module}/prometheus-values.yaml", {
      grafana_hostname = "grafana.${var.domain_name}"
      prometheus_hostname = "prometheus.${var.domain_name}"
    })
  ]

  depends_on = [module.eks]
}

# External Secrets Operator
resource "helm_release" "external_secrets" {
  name       = "external-secrets"
  repository = "https://charts.external-secrets.io"
  chart      = "external-secrets"
  namespace  = "external-secrets-system"
  version    = "0.9.11"

  create_namespace = true

  depends_on = [module.eks]
}

# AWS Load Balancer Controller
resource "helm_release" "aws_load_balancer_controller" {
  name       = "aws-load-balancer-controller"
  repository = "https://aws.github.io/eks-charts"
  chart      = "aws-load-balancer-controller"
  namespace  = "kube-system"
  version    = "1.6.2"

  set {
    name  = "clusterName"
    value = module.eks.cluster_name
  }

  set {
    name  = "serviceAccount.create"
    value = "true"
  }

  set {
    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = aws_iam_role.aws_load_balancer_controller.arn
  }

  depends_on = [module.eks]
}

# RDS Database
resource "aws_db_instance" "main" {
  identifier             = "${var.project_name}-production"
  engine                 = "postgres"
  engine_version        = "15.4"
  instance_class        = "db.t3.medium"
  allocated_storage     = 100
  max_allocated_storage = 1000

  db_name  = var.database_name
  username = var.database_username
  password = var.database_password

  vpc_security_group_ids = [aws_security_group.rds.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "sun:04:00-sun:05:00"

  skip_final_snapshot = false
  final_snapshot_identifier = "${var.project_name}-final-snapshot-${random_id.snapshot_suffix.hex}"

  performance_insights_enabled = true
  monitoring_interval         = 60
  monitoring_role_arn        = aws_iam_role.rds_enhanced_monitoring.arn

  tags = var.common_tags
}

# ElastiCache Redis
resource "aws_elasticache_replication_group" "main" {
  replication_group_id       = "${var.project_name}-redis"
  description                = "Redis cluster for ${var.project_name}"

  num_cache_clusters         = 2
  node_type                 = "cache.t3.micro"
  port                      = 6379
  parameter_group_name      = "default.redis7"

  subnet_group_name         = aws_elasticache_subnet_group.main.name
  security_group_ids        = [aws_security_group.redis.id]

  at_rest_encryption_enabled = true
  transit_encryption_enabled = true

  tags = var.common_tags
}

# Outputs
output "cluster_endpoint" {
  description = "Endpoint for EKS control plane"
  value       = module.eks.cluster_endpoint
}

output "cluster_security_group_id" {
  description = "Security group ids attached to the cluster control plane"
  value       = module.eks.cluster_security_group_id
}

output "database_endpoint" {
  description = "RDS instance endpoint"
  value       = aws_db_instance.main.endpoint
  sensitive   = true
}

output "redis_endpoint" {
  description = "Redis cluster endpoint"
  value       = aws_elasticache_replication_group.main.primary_endpoint_address
  sensitive   = true
}
hcl

Conclusion

This advanced CI/CD pipeline implementation provides enterprise-grade capabilities:

โœ… Key Achievements

  1. Comprehensive Testing: Multi-stage testing from unit tests to E2E validation
  2. Security Integration: Built-in security scanning and vulnerability management
  3. GitOps Workflow: Git-based deployment with automated synchronization
  4. Zero-Downtime Deployments: Rolling updates with health checks and rollback
  5. Infrastructure as Code: Fully automated infrastructure provisioning
  6. Monitoring & Alerting: Real-time observability with automated responses

๐Ÿš€ Enterprise Benefits

  • Reliability: Automated testing and validation at every stage
  • Security: Built-in security scanning and compliance checks
  • Scalability: Container-based deployments with auto-scaling
  • Observability: Comprehensive monitoring and alerting
  • Governance: Git-based approvals and audit trails
  • Recovery: Automated rollback and disaster recovery

๐Ÿ”„ Next Steps

  1. Customize for Your Stack: Adapt the pipeline for your technology choices
  2. Implement Gradually: Start with staging environments and expand
  3. Train Your Team: Ensure everyone understands GitOps principles
  4. Optimize Performance: Fine-tune based on your specific requirements
  5. Enhance Security: Add additional security scanning and compliance checks

This advanced CI/CD system transforms software delivery from a manual, error-prone process into a reliable, automated, and secure pipeline that scales with your organization's needs.


Ready to implement advanced CI/CD? Contact our DevOps consulting team for a comprehensive assessment and implementation roadmap tailored to your organization.

Published on 6/8/2024

Found this helpful? Share it with your network!

Share:๐Ÿฆ๐Ÿ’ผ

Yogesh Bhandari

Technology Visionary & Co-Founder

Building the future through cloud innovation, AI solutions, and open-source contributions.

CTO & Co-Founderโ˜๏ธ Cloud Expert๐Ÿš€ AI Pioneer
ยฉ 2025 Yogesh Bhandari.Made with in Nepal

Empowering organizations through cloud transformation, AI innovation, and scalable solutions.

๐ŸŒ Global Remoteโ€ขโ˜๏ธ Cloud-Firstโ€ข๐Ÿš€ Always Buildingโ€ข๐Ÿค Open to Collaborate