Advanced CI/CD Pipelines: GitOps and Infrastructure Automation at Scale
Build sophisticated CI/CD pipelines using GitOps principles, automated testing, and infrastructure as code for enterprise-scale deployments.
Advanced CI/CD Pipelines: GitOps and Infrastructure Automation at Scale
The Evolution of DevOps
Modern software delivery has evolved far beyond simple build-and-deploy scripts. Today's enterprises demand zero-downtime deployments, automated rollbacks, comprehensive testing, and infrastructure that scales automatically. This is where advanced CI/CD pipelines combined with GitOps principles become game-changers.
In this comprehensive guide, we'll build a complete enterprise-grade CI/CD system that handles everything from code commits to production deployments, with safety nets, monitoring, and automated recovery mechanisms.
Why Advanced CI/CD Matters
๐ Business Impact
- Faster Time to Market: Deploy features in minutes, not hours
- Reduced Risk: Automated testing and rollback capabilities
- Higher Reliability: Consistent, repeatable deployments
- Developer Productivity: Focus on code, not deployment complexity
๐ก๏ธ Technical Benefits
- Zero-Downtime Deployments: Blue-green and canary deployment strategies
- Automated Testing: Comprehensive testing at every stage
- Infrastructure as Code: Version-controlled, auditable infrastructure
- Observability: Complete visibility into deployment pipeline
GitOps Architecture Overview
Our advanced CI/CD system follows GitOps principles:
Complete CI/CD Pipeline Implementation
GitHub Actions Workflow
# .github/workflows/ci-cd.yml
name: Advanced CI/CD Pipeline
on:
push:
branches: [main, develop]
paths-ignore:
- 'docs/**'
- '*.md'
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
NODE_VERSION: '18'
PYTHON_VERSION: '3.11'
jobs:
# ================================
# CONTINUOUS INTEGRATION
# ================================
code-quality:
name: Code Quality & Security
runs-on: ubuntu-latest
outputs:
should-deploy: ${{ steps.changes.outputs.should-deploy }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0 # Needed for proper diff analysis
- name: Detect changes
id: changes
uses: dorny/paths-filter@v2
with:
filters: |
should-deploy:
- 'src/**'
- 'package*.json'
- 'Dockerfile'
- '.github/workflows/**'
- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: |
npm ci
npm audit --audit-level high
- name: Lint code
run: |
npm run lint
npm run format:check
- name: Type checking
run: npm run type-check
- name: Security scanning (Semgrep)
uses: returntocorp/semgrep-action@v1
with:
publishToken: ${{ secrets.SEMGREP_APP_TOKEN }}
- name: Dependency vulnerability scan
run: npm audit --audit-level high
- name: SonarCloud Scan
uses: SonarSource/sonarcloud-github-action@master
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
unit-tests:
name: Unit & Integration Tests
runs-on: ubuntu-latest
needs: code-quality
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: postgres
POSTGRES_DB: testdb
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 5432:5432
redis:
image: redis:7
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 6379:6379
strategy:
matrix:
node-version: [16, 18, 20]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run unit tests
run: npm run test:unit -- --coverage --watchAll=false
env:
DATABASE_URL: postgresql://postgres:postgres@localhost:5432/testdb
REDIS_URL: redis://localhost:6379
- name: Run integration tests
run: npm run test:integration
env:
DATABASE_URL: postgresql://postgres:postgres@localhost:5432/testdb
REDIS_URL: redis://localhost:6379
NODE_ENV: test
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage/lcov.info
flags: unittests
name: codecov-umbrella
performance-tests:
name: Performance & Load Tests
runs-on: ubuntu-latest
needs: [code-quality, unit-tests]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Build application
run: |
npm ci
npm run build
- name: Start application
run: |
npm start &
sleep 30 # Wait for app to start
env:
NODE_ENV: production
PORT: 3000
- name: Run Lighthouse CI
uses: treosh/lighthouse-ci-action@v9
with:
urls: |
http://localhost:3000
http://localhost:3000/api/health
configPath: './lighthouse.config.js'
uploadArtifacts: true
- name: Load testing with Artillery
run: |
npm install -g artillery
artillery run load-test.yml
- name: Performance regression check
run: |
# Compare metrics with baseline
node scripts/performance-check.js
security-tests:
name: Security Testing
runs-on: ubuntu-latest
needs: code-quality
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Build Docker image for security scanning
run: |
docker build -t security-test:latest .
- name: Container vulnerability scan
uses: aquasecurity/trivy-action@master
with:
image-ref: 'security-test:latest'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy scan results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
- name: OWASP ZAP security scan
uses: zaproxy/[email protected]
with:
target: 'http://localhost:3000'
docker_name: 'owasp/zap2docker-stable'
# ================================
# BUILD & PACKAGE
# ================================
build-and-push:
name: Build & Push Container
runs-on: ubuntu-latest
needs: [code-quality, unit-tests, performance-tests, security-tests]
if: needs.code-quality.outputs.should-deploy == 'true'
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
image-digest: ${{ steps.build.outputs.digest }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=ref,event=pr
type=sha,prefix={{branch}}-
type=raw,value=latest,enable={{is_default_branch}}
- name: Build and push Docker image
id: build
uses: docker/build-push-action@v5
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
BUILD_DATE=${{ steps.meta.outputs.labels['org.opencontainers.image.created'] }}
VCS_REF=${{ github.sha }}
- name: Sign container image
uses: sigstore/cosign-installer@v3
- name: Sign the published Docker image
env:
COSIGN_EXPERIMENTAL: 1
run: |
echo "${{ steps.meta.outputs.tags }}" | xargs -I {} cosign sign {}@${{ steps.build.outputs.digest }}
# ================================
# CONTINUOUS DEPLOYMENT
# ================================
deploy-staging:
name: Deploy to Staging
runs-on: ubuntu-latest
needs: [build-and-push]
if: github.ref == 'refs/heads/main'
environment:
name: staging
url: https://staging.myapp.com
steps:
- name: Checkout GitOps repo
uses: actions/checkout@v4
with:
repository: myorg/gitops-config
token: ${{ secrets.GITOPS_TOKEN }}
path: gitops
- name: Update staging manifest
run: |
cd gitops
# Update the staging deployment with new image
yq eval '.spec.template.spec.containers[0].image = "${{ needs.build-and-push.outputs.image-tag }}"' \
-i staging/deployment.yaml
# Update configmap with build info
yq eval '.data.BUILD_SHA = "${{ github.sha }}"' \
-i staging/configmap.yaml
yq eval '.data.BUILD_DATE = "${{ github.event.head_commit.timestamp }}"' \
-i staging/configmap.yaml
- name: Commit and push changes
run: |
cd gitops
git config user.name "GitHub Actions"
git config user.email "[email protected]"
git add .
git commit -m "Deploy ${{ github.sha }} to staging"
git push
- name: Wait for deployment
run: |
# Wait for ArgoCD to sync and deploy
sleep 60
- name: Health check
run: |
# Verify deployment is healthy
curl -f https://staging.myapp.com/health || exit 1
- name: Run smoke tests
run: |
npm ci
npm run test:smoke -- --baseUrl=https://staging.myapp.com
e2e-tests:
name: End-to-End Tests
runs-on: ubuntu-latest
needs: deploy-staging
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Run Playwright tests
uses: microsoft/playwright-github-action@v1
- name: Install dependencies
run: npm ci
- name: Run E2E tests
run: npm run test:e2e
env:
BASE_URL: https://staging.myapp.com
E2E_USERNAME: ${{ secrets.E2E_USERNAME }}
E2E_PASSWORD: ${{ secrets.E2E_PASSWORD }}
- name: Upload test results
uses: actions/upload-artifact@v3
if: always()
with:
name: playwright-results
path: test-results/
deploy-production:
name: Deploy to Production
runs-on: ubuntu-latest
needs: [build-and-push, e2e-tests]
if: github.ref == 'refs/heads/main'
environment:
name: production
url: https://myapp.com
steps:
- name: Checkout GitOps repo
uses: actions/checkout@v4
with:
repository: myorg/gitops-config
token: ${{ secrets.GITOPS_TOKEN }}
path: gitops
- name: Create production deployment PR
run: |
cd gitops
# Create a new branch for production deployment
git checkout -b deploy-prod-${{ github.sha }}
# Update production manifest
yq eval '.spec.template.spec.containers[0].image = "${{ needs.build-and-push.outputs.image-tag }}"' \
-i production/deployment.yaml
# Update configmap
yq eval '.data.BUILD_SHA = "${{ github.sha }}"' \
-i production/configmap.yaml
yq eval '.data.BUILD_DATE = "${{ github.event.head_commit.timestamp }}"' \
-i production/configmap.yaml
# Commit changes
git config user.name "GitHub Actions"
git config user.email "[email protected]"
git add .
git commit -m "Deploy ${{ github.sha }} to production"
git push -u origin deploy-prod-${{ github.sha }}
# Create PR for review
gh pr create \
--title "๐ Deploy ${{ github.sha }} to Production" \
--body "Automated deployment PR for commit ${{ github.sha }}" \
--reviewer team:platform-team \
--base main \
--head deploy-prod-${{ github.sha }}
env:
GITHUB_TOKEN: ${{ secrets.GITOPS_TOKEN }}
post-deployment:
name: Post-Deployment Verification
runs-on: ubuntu-latest
needs: deploy-production
steps:
- name: Production health check
run: |
# Wait for deployment to complete
sleep 120
# Comprehensive health checks
curl -f https://myapp.com/health
curl -f https://myapp.com/api/health
curl -f https://myapp.com/metrics
- name: Performance validation
run: |
# Run production performance tests
npm ci
npm run test:performance:prod
- name: Update deployment status
uses: deployments/create-deployment-status@v1
with:
token: ${{ secrets.GITHUB_TOKEN }}
deployment-id: ${{ needs.deploy-production.outputs.deployment-id }}
state: success
environment-url: https://myapp.com
- name: Notify team
uses: 8398a7/action-slack@v3
with:
status: success
text: "๐ Successfully deployed ${{ github.sha }} to production!"
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
yamlAdvanced Docker Configuration
# Multi-stage Dockerfile for optimized builds
FROM node:18-alpine AS base
RUN apk add --no-cache dumb-init
WORKDIR /usr/src/app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
# Development stage
FROM base AS dev
ENV NODE_ENV=development
RUN npm ci
COPY . .
CMD ["dumb-init", "npm", "run", "dev"]
# Build stage
FROM base AS build
ENV NODE_ENV=production
COPY . .
RUN npm ci && npm run build && npm prune --production
# Production stage
FROM node:18-alpine AS production
RUN apk add --no-cache dumb-init curl
# Create app user
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
WORKDIR /usr/src/app
COPY --from=build --chown=nodejs:nodejs /usr/src/app/dist ./dist
COPY --from=build --chown=nodejs:nodejs /usr/src/app/node_modules ./node_modules
COPY --from=build --chown=nodejs:nodejs /usr/src/app/package*.json ./
# Security labels
LABEL org.opencontainers.image.source="https://github.com/myorg/myapp"
LABEL org.opencontainers.image.vendor="MyOrg"
LABEL org.opencontainers.image.licenses="MIT"
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
USER nodejs
EXPOSE 3000
CMD ["dumb-init", "node", "dist/index.js"]
dockerfileKubernetes Deployment Configuration
# k8s/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: production
labels:
app: myapp
version: v1
annotations:
deployment.kubernetes.io/revision: "1"
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 1
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
version: v1
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3000"
prometheus.io/path: "/metrics"
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
containers:
- name: myapp
image: ghcr.io/myorg/myapp:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 3000
name: http
protocol: TCP
env:
- name: NODE_ENV
value: "production"
- name: PORT
value: "3000"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: redis-url
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30
volumeMounts:
- name: app-config
mountPath: /usr/src/app/config
readOnly: true
- name: temp-storage
mountPath: /tmp
volumes:
- name: app-config
configMap:
name: app-config
- name: temp-storage
emptyDir: {}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myapp
topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
name: myapp-service
namespace: production
labels:
app: myapp
spec:
type: ClusterIP
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app: myapp
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
namespace: production
annotations:
kubernetes.io/ingress.class: "nginx"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
tls:
- hosts:
- myapp.com
- www.myapp.com
secretName: myapp-tls
rules:
- host: myapp.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-service
port:
number: 80
- host: www.myapp.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-service
port:
number: 80
yamlArgoCD Application Configuration
# argocd/applications/myapp-production.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp-production
namespace: argocd
labels:
environment: production
team: platform
spec:
project: production
source:
repoURL: https://github.com/myorg/gitops-config
targetRevision: HEAD
path: production
helm:
valueFiles:
- values.yaml
- values-production.yaml
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- Validate=true
- CreateNamespace=false
- PrunePropagationPolicy=foreground
- PruneLast=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m0s
revisionHistoryLimit: 10
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
info:
- name: "Environment"
value: "Production"
- name: "Team"
value: "Platform Team"
---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: production
namespace: argocd
spec:
description: "Production applications"
sourceRepos:
- 'https://github.com/myorg/gitops-config'
destinations:
- namespace: production
server: https://kubernetes.default.svc
clusterResourceWhitelist:
- group: ''
kind: Namespace
- group: 'rbac.authorization.k8s.io'
kind: ClusterRole
- group: 'rbac.authorization.k8s.io'
kind: ClusterRoleBinding
namespaceResourceWhitelist:
- group: '*'
kind: '*'
roles:
- name: production-admin
description: "Production admin access"
policies:
- p, proj:production:production-admin, applications, *, production/*, allow
- p, proj:production:production-admin, repositories, *, *, allow
groups:
- myorg:platform-team
- name: production-readonly
description: "Production read-only access"
policies:
- p, proj:production:production-readonly, applications, get, production/*, allow
groups:
- myorg:developers
yamlAdvanced Monitoring & Alerting
# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
namespace: production
spec:
groups:
- name: myapp.deployment
interval: 30s
rules:
- alert: DeploymentReplicasMismatch
expr: |
(
kube_deployment_spec_replicas{namespace="production",deployment="myapp"}
!=
kube_deployment_status_replicas_available{namespace="production",deployment="myapp"}
) > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Deployment replica mismatch for {{ $labels.deployment }}"
description: "Deployment {{ $labels.deployment }} has {{ $value }} replica(s) unavailable"
- alert: HighPodRestartRate
expr: |
rate(kube_pod_container_status_restarts_total{namespace="production"}[5m]) > 0.1
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High pod restart rate for {{ $labels.pod }}"
description: "Pod {{ $labels.pod }} is restarting frequently"
- name: myapp.performance
interval: 30s
rules:
- alert: HighResponseTime
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="myapp"}[5m])) > 0.5
for: 2m
labels:
severity: warning
team: platform
annotations:
summary: "High response time for myapp"
description: "95th percentile response time is {{ $value }}s"
- alert: HighErrorRate
expr: |
(
rate(http_requests_total{job="myapp",status=~"5.."}[5m])
/
rate(http_requests_total{job="myapp"}[5m])
) > 0.05
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate for myapp"
description: "Error rate is {{ $value | humanizePercentage }}"
- name: myapp.resources
interval: 30s
rules:
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{namespace="production",container="myapp"}
/
container_spec_memory_limit_bytes{namespace="production",container="myapp"}
) > 0.8
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High memory usage for {{ $labels.pod }}"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: HighCPUUsage
expr: |
(
rate(container_cpu_usage_seconds_total{namespace="production",container="myapp"}[5m])
/
container_spec_cpu_quota{namespace="production",container="myapp"}
) * 100000 > 80
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High CPU usage for {{ $labels.pod }}"
description: "CPU usage is {{ $value }}%"
---
# Grafana Dashboard Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: myapp-dashboard
namespace: monitoring
data:
dashboard.json: |
{
"dashboard": {
"title": "MyApp Production Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{job=\"myapp\"}[5m])",
"legendFormat": "{{ method }} {{ status }}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m]))",
"legendFormat": "50th percentile"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m]))",
"legendFormat": "95th percentile"
}
]
}
]
}
}
yamlAutomated Rollback System
# scripts/auto-rollback.py
import os
import requests
import subprocess
import time
from datetime import datetime, timedelta
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class AutoRollback:
def __init__(self):
self.prometheus_url = os.getenv('PROMETHEUS_URL', 'http://prometheus:9090')
self.argocd_url = os.getenv('ARGOCD_URL', 'http://argocd:8080')
self.argocd_token = os.getenv('ARGOCD_TOKEN')
self.app_name = os.getenv('APP_NAME', 'myapp-production')
self.namespace = os.getenv('NAMESPACE', 'production')
# Rollback thresholds
self.error_rate_threshold = 0.05 # 5%
self.response_time_threshold = 1.0 # 1 second
self.check_interval = 60 # seconds
self.rollback_window = 300 # 5 minutes after deployment
def check_deployment_health(self):
"""Check if the current deployment is healthy"""
try:
# Get deployment time
deployment_time = self.get_last_deployment_time()
if not deployment_time:
return True
# Only check if deployment is within rollback window
time_diff = datetime.now() - deployment_time
if time_diff > timedelta(seconds=self.rollback_window):
return True
# Check error rate
error_rate = self.get_error_rate()
if error_rate > self.error_rate_threshold:
logger.error(f"High error rate detected: {error_rate:.2%}")
return False
# Check response time
response_time = self.get_response_time()
if response_time > self.response_time_threshold:
logger.error(f"High response time detected: {response_time:.2f}s")
return False
# Check pod health
unhealthy_pods = self.get_unhealthy_pods()
if unhealthy_pods > 0:
logger.error(f"Unhealthy pods detected: {unhealthy_pods}")
return False
logger.info("Deployment health check passed")
return True
except Exception as e:
logger.error(f"Health check failed: {str(e)}")
return False
def get_error_rate(self):
"""Get current error rate from Prometheus"""
query = f'''
(
rate(http_requests_total{{namespace="{self.namespace}",status=~"5.."}}[5m])
/
rate(http_requests_total{{namespace="{self.namespace}"}}[5m])
)
'''
response = requests.get(
f'{self.prometheus_url}/api/v1/query',
params={'query': query}
)
data = response.json()
if data['data']['result']:
return float(data['data']['result'][0]['value'][1])
return 0.0
def get_response_time(self):
"""Get 95th percentile response time"""
query = f'''
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{{namespace="{self.namespace}"}}[5m])
)
'''
response = requests.get(
f'{self.prometheus_url}/api/v1/query',
params={'query': query}
)
data = response.json()
if data['data']['result']:
return float(data['data']['result'][0]['value'][1])
return 0.0
def get_unhealthy_pods(self):
"""Get number of unhealthy pods"""
query = f'''
kube_pod_status_ready{{namespace="{self.namespace}",condition="false"}}
'''
response = requests.get(
f'{self.prometheus_url}/api/v1/query',
params={'query': query}
)
data = response.json()
return len(data['data']['result'])
def get_last_deployment_time(self):
"""Get timestamp of last deployment"""
try:
result = subprocess.run([
'kubectl', 'get', 'deployment', 'myapp',
'-n', self.namespace,
'-o', 'jsonpath={.metadata.annotations.deployment\.kubernetes\.io/revision}'
], capture_output=True, text=True, check=True)
# This is a simplified implementation
# In practice, you'd parse the actual deployment timestamp
return datetime.now() - timedelta(minutes=2)
except:
return None
def rollback_deployment(self):
"""Rollback to previous version using ArgoCD"""
try:
logger.info(f"Initiating rollback for {self.app_name}")
# Get application history
headers = {'Authorization': f'Bearer {self.argocd_token}'}
response = requests.get(
f'{self.argocd_url}/api/v1/applications/{self.app_name}/revisions',
headers=headers
)
revisions = response.json()
if len(revisions) < 2:
logger.error("No previous revision available for rollback")
return False
# Get previous revision
previous_revision = revisions[1]['revision']
# Rollback application
rollback_request = {
'revision': previous_revision,
'prune': True,
'dryRun': False
}
response = requests.post(
f'{self.argocd_url}/api/v1/applications/{self.app_name}/rollback',
headers=headers,
json=rollback_request
)
if response.status_code == 200:
logger.info(f"Rollback initiated successfully to revision {previous_revision}")
self.send_rollback_notification(previous_revision)
return True
else:
logger.error(f"Rollback failed: {response.text}")
return False
except Exception as e:
logger.error(f"Rollback operation failed: {str(e)}")
return False
def send_rollback_notification(self, revision):
"""Send notification about rollback"""
webhook_url = os.getenv('SLACK_WEBHOOK_URL')
if not webhook_url:
return
message = {
"text": f"๐จ Automatic rollback triggered for {self.app_name}",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Automatic Rollback Executed*\n\n"
f"โข Application: `{self.app_name}`\n"
f"โข Rolled back to: `{revision}`\n"
f"โข Reason: Health check failure\n"
f"โข Time: {datetime.now().isoformat()}"
}
}
]
}
requests.post(webhook_url, json=message)
def monitor_and_rollback(self):
"""Main monitoring loop"""
logger.info(f"Starting health monitoring for {self.app_name}")
while True:
try:
if not self.check_deployment_health():
logger.warning("Health check failed, initiating rollback")
if self.rollback_deployment():
logger.info("Rollback completed successfully")
# Exit after rollback
break
else:
logger.error("Rollback failed, manual intervention required")
time.sleep(self.check_interval)
except KeyboardInterrupt:
logger.info("Monitoring stopped by user")
break
except Exception as e:
logger.error(f"Monitoring error: {str(e)}")
time.sleep(self.check_interval)
if __name__ == "__main__":
monitor = AutoRollback()
monitor.monitor_and_rollback()
pythonInfrastructure as Code with Terraform
# infrastructure/terraform/main.tf
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.16"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.8"
}
}
backend "s3" {
bucket = "myorg-terraform-state"
key = "production/terraform.tfstate"
region = "us-west-2"
encrypt = true
}
}
# EKS Cluster
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_name = var.cluster_name
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
cluster_endpoint_public_access_cidrs = ["0.0.0.0/0"]
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
}
aws-ebs-csi-driver = {
most_recent = true
}
}
eks_managed_node_groups = {
main = {
min_size = 3
max_size = 10
desired_size = 5
instance_types = ["m5.large", "m5.xlarge"]
k8s_labels = {
Environment = "production"
Application = "myapp"
}
tags = {
ExtraTag = "production-nodes"
}
}
}
tags = var.common_tags
}
# ArgoCD Installation
resource "helm_release" "argocd" {
name = "argocd"
repository = "https://argoproj.github.io/argo-helm"
chart = "argo-cd"
namespace = "argocd"
version = "5.51.6"
create_namespace = true
values = [
templatefile("${path.module}/argocd-values.yaml", {
hostname = "argocd.${var.domain_name}"
})
]
depends_on = [module.eks]
}
# Prometheus and Grafana
resource "helm_release" "kube_prometheus_stack" {
name = "kube-prometheus-stack"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "kube-prometheus-stack"
namespace = "monitoring"
version = "55.5.0"
create_namespace = true
values = [
templatefile("${path.module}/prometheus-values.yaml", {
grafana_hostname = "grafana.${var.domain_name}"
prometheus_hostname = "prometheus.${var.domain_name}"
})
]
depends_on = [module.eks]
}
# External Secrets Operator
resource "helm_release" "external_secrets" {
name = "external-secrets"
repository = "https://charts.external-secrets.io"
chart = "external-secrets"
namespace = "external-secrets-system"
version = "0.9.11"
create_namespace = true
depends_on = [module.eks]
}
# AWS Load Balancer Controller
resource "helm_release" "aws_load_balancer_controller" {
name = "aws-load-balancer-controller"
repository = "https://aws.github.io/eks-charts"
chart = "aws-load-balancer-controller"
namespace = "kube-system"
version = "1.6.2"
set {
name = "clusterName"
value = module.eks.cluster_name
}
set {
name = "serviceAccount.create"
value = "true"
}
set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = aws_iam_role.aws_load_balancer_controller.arn
}
depends_on = [module.eks]
}
# RDS Database
resource "aws_db_instance" "main" {
identifier = "${var.project_name}-production"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.medium"
allocated_storage = 100
max_allocated_storage = 1000
db_name = var.database_name
username = var.database_username
password = var.database_password
vpc_security_group_ids = [aws_security_group.rds.id]
db_subnet_group_name = aws_db_subnet_group.main.name
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
skip_final_snapshot = false
final_snapshot_identifier = "${var.project_name}-final-snapshot-${random_id.snapshot_suffix.hex}"
performance_insights_enabled = true
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_enhanced_monitoring.arn
tags = var.common_tags
}
# ElastiCache Redis
resource "aws_elasticache_replication_group" "main" {
replication_group_id = "${var.project_name}-redis"
description = "Redis cluster for ${var.project_name}"
num_cache_clusters = 2
node_type = "cache.t3.micro"
port = 6379
parameter_group_name = "default.redis7"
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [aws_security_group.redis.id]
at_rest_encryption_enabled = true
transit_encryption_enabled = true
tags = var.common_tags
}
# Outputs
output "cluster_endpoint" {
description = "Endpoint for EKS control plane"
value = module.eks.cluster_endpoint
}
output "cluster_security_group_id" {
description = "Security group ids attached to the cluster control plane"
value = module.eks.cluster_security_group_id
}
output "database_endpoint" {
description = "RDS instance endpoint"
value = aws_db_instance.main.endpoint
sensitive = true
}
output "redis_endpoint" {
description = "Redis cluster endpoint"
value = aws_elasticache_replication_group.main.primary_endpoint_address
sensitive = true
}
hclConclusion
This advanced CI/CD pipeline implementation provides enterprise-grade capabilities:
โ Key Achievements
- Comprehensive Testing: Multi-stage testing from unit tests to E2E validation
- Security Integration: Built-in security scanning and vulnerability management
- GitOps Workflow: Git-based deployment with automated synchronization
- Zero-Downtime Deployments: Rolling updates with health checks and rollback
- Infrastructure as Code: Fully automated infrastructure provisioning
- Monitoring & Alerting: Real-time observability with automated responses
๐ Enterprise Benefits
- Reliability: Automated testing and validation at every stage
- Security: Built-in security scanning and compliance checks
- Scalability: Container-based deployments with auto-scaling
- Observability: Comprehensive monitoring and alerting
- Governance: Git-based approvals and audit trails
- Recovery: Automated rollback and disaster recovery
๐ Next Steps
- Customize for Your Stack: Adapt the pipeline for your technology choices
- Implement Gradually: Start with staging environments and expand
- Train Your Team: Ensure everyone understands GitOps principles
- Optimize Performance: Fine-tune based on your specific requirements
- Enhance Security: Add additional security scanning and compliance checks
This advanced CI/CD system transforms software delivery from a manual, error-prone process into a reliable, automated, and secure pipeline that scales with your organization's needs.
Ready to implement advanced CI/CD? Contact our DevOps consulting team for a comprehensive assessment and implementation roadmap tailored to your organization.