Deploying a Resilient AI Application on Kubernetes

Introduction

Running stateful, resource-intensive AI applications in production requires more than just a powerful model. It demands a resilient, scalable, and observable infrastructure. Kubernetes, with its powerful orchestration capabilities, provides the perfect foundation. In this post, we'll explore the best practices for deploying a production-grade AI application on K8s.

Modern AI workloads present unique challenges: they're computationally intensive, require specialized hardware (GPUs/TPUs), handle large models and datasets, and need to serve real-time predictions at scale. Traditional deployment approaches often fall short when dealing with these requirements.

Architecture Overview

Our resilient AI application architecture follows cloud-native principles while addressing the specific needs of machine learning workloads:

Key Architectural Components

Our architecture consists of several key components working in tandem:

1. Containerization

The first step is to containerize our AI application using Docker with optimized base images and efficient model loading.

2. StatefulSets

For components that require stable network identifiers and persistent storage, such as model servers with cached models.

3. Horizontal Pod Autoscalers (HPA)

To automatically scale our inference services based on CPU, memory, or custom metrics like inference latency.

4. Custom Resource Definitions (CRDs)

For managing AI-specific resources like model versions, training jobs, and inference endpoints.

Complete Kubernetes Deployment

Let's start with a comprehensive deployment configuration:

# ai-inference-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-app
  namespace: ml-production
  labels:
    app: ai-inference
    version: v1.2.0
    component: inference-server
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 2
  selector:
    matchLabels:
      app: ai-inference
  template:
    metadata:
      labels:
        app: ai-inference
        version: v1.2.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
        # Force pod restart when model configmap changes
        checksum/model-config: "{{ include (print $.Template.BasePath \"/model-configmap.yaml\") . | sha256sum }}"
    spec:
      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001

      # Node selection for GPU nodes
      nodeSelector:
        accelerator: nvidia-tesla-t4

      # Tolerations for GPU nodes
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

      # Pod anti-affinity for high availability
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ai-inference
              topologyKey: kubernetes.io/hostname

      # Init container for model loading
      initContainers:
      - name: model-downloader
        image: alpine/curl:latest
        command:
        - sh
        - -c
        - |
          echo "Downloading model files..."
          curl -L $MODEL_URL -o /shared/model.pkl
          echo "Model download completed"
        env:
        - name: MODEL_URL
          valueFrom:
            configMapKeyRef:
              name: model-config
              key: model-url
        volumeMounts:
        - name: model-storage
          mountPath: /shared
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi

      containers:
      - name: inference-container
        image: your-repo/ai-inference:v1.2.0
        imagePullPolicy: Always

        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        - containerPort: 8090
          name: grpc
          protocol: TCP

        env:
        - name: MODEL_PATH
          value: "/app/models/model.pkl"
        - name: BATCH_SIZE
          value: "32"
        - name: MAX_WORKERS
          value: "4"
        - name: LOG_LEVEL
          value: "INFO"
        - name: PROMETHEUS_PORT
          value: "8080"
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: redis-secret
              key: url
        - name: DB_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: connection-string

        # Resource requirements with GPU
        resources:
          requests:
            memory: "4Gi"
            cpu: "1000m"
            nvidia.com/gpu: 1
          limits:
            memory: "8Gi"
            cpu: "2000m"
            nvidia.com/gpu: 1

        # Health checks
        livenessProbe:
          httpGet:
            path: /health
            port: http
          initialDelaySeconds: 60
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3
          successThreshold: 1

        readinessProbe:
          httpGet:
            path: /ready
            port: http
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
          successThreshold: 1

        # Startup probe for slow model loading
        startupProbe:
          httpGet:
            path: /startup
            port: http
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30
          successThreshold: 1

        # Volume mounts
        volumeMounts:
        - name: model-storage
          mountPath: /app/models
          readOnly: true
        - name: cache-volume
          mountPath: /tmp/cache
        - name: config-volume
          mountPath: /app/config
          readOnly: true
        - name: logging-config
          mountPath: /app/logging.conf
          subPath: logging.conf
          readOnly: true

      # Volumes
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
      - name: cache-volume
        emptyDir:
          sizeLimit: 2Gi
      - name: config-volume
        configMap:
          name: inference-config
      - name: logging-config
        configMap:
          name: logging-config

      # Image pull secrets
      imagePullSecrets:
      - name: registry-secret

---
# Service for the inference application
apiVersion: v1
kind: Service
metadata:
  name: ai-inference-service
  namespace: ml-production
  labels:
    app: ai-inference
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: http
    protocol: TCP
    name: http
  - port: 9090
    targetPort: grpc
    protocol: TCP
    name: grpc
  selector:
    app: ai-inference

---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
  namespace: ml-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  # CPU-based scaling
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  # Memory-based scaling
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  # Custom metric: inference latency
  - type: Pods
    pods:
      metric:
        name: inference_latency_p95
      target:
        type: AverageValue
        averageValue: "500m"  # 500ms
  # Custom metric: queue depth
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 25
        periodSeconds: 60

yaml

Model Storage and Management

For efficient model management, we use a combination of persistent volumes and model versioning:

# model-storage.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
  namespace: ml-production
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: ssd-retain
  resources:
    requests:
      storage: 50Gi

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: model-config
  namespace: ml-production
data:
  model-url: "https://your-model-store.com/models/v1.2.0/model.pkl"
  model-version: "v1.2.0"
  model-checksum: "sha256:abc123..."
  batch-size: "32"
  max-sequence-length: "512"

---
# Model management CRD
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: modelversions.ml.example.com
spec:
  group: ml.example.com
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              modelName:
                type: string
              version:
                type: string
              modelUri:
                type: string
              framework:
                type: string
              replicas:
                type: integer
              resources:
                type: object
          status:
            type: object
            properties:
              phase:
                type: string
              message:
                type: string
  scope: Namespaced
  names:
    plural: modelversions
    singular: modelversion
    kind: ModelVersion

---
# Model version instance
apiVersion: ml.example.com/v1
kind: ModelVersion
metadata:
  name: sentiment-model-v1-2-0
  namespace: ml-production
spec:
  modelName: sentiment-analyzer
  version: "v1.2.0"
  modelUri: "s3://ml-models/sentiment/v1.2.0/"
  framework: "pytorch"
  replicas: 3
  resources:
    requests:
      memory: "4Gi"
      cpu: "1000m"
      nvidia.com/gpu: 1
    limits:
      memory: "8Gi"
      cpu: "2000m"
      nvidia.com/gpu: 1

yaml

Advanced Configuration

ConfigMaps for Application Configuration

# inference-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: inference-config
  namespace: ml-production
data:
  app.yaml: |
    server:
      host: 0.0.0.0
      port: 8080
      workers: 4
      timeout: 30

    model:
      batch_size: 32
      max_sequence_length: 512
      warmup_requests: 10

    cache:
      enabled: true
      ttl: 3600
      max_size: 1000

    monitoring:
      metrics_port: 8080
      health_check_interval: 30

    features:
      preprocessing: true
      postprocessing: true
      batch_inference: true
      streaming: false

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: logging-config
  namespace: ml-production
data:
  logging.conf: |
    [loggers]
    keys=root,inference,model

    [handlers]
    keys=consoleHandler,fileHandler

    [formatters]
    keys=detailedFormatter

    [logger_root]
    level=INFO
    handlers=consoleHandler

    [logger_inference]
    level=DEBUG
    handlers=consoleHandler,fileHandler
    qualname=inference
    propagate=0

    [logger_model]
    level=INFO
    handlers=consoleHandler,fileHandler
    qualname=model
    propagate=0

    [handler_consoleHandler]
    class=StreamHandler
    level=INFO
    formatter=detailedFormatter
    args=(sys.stdout,)

    [handler_fileHandler]
    class=FileHandler
    level=DEBUG
    formatter=detailedFormatter
    args=('/var/log/app.log',)

    [formatter_detailedFormatter]
    format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
    datefmt=%Y-%m-%d %H:%M:%S

yaml

Secrets Management

# secrets.yaml
apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: ml-production
type: Opaque
stringData:
  connection-string: "postgresql://ml_user:secure_password@postgres-service:5432/ml_db?sslmode=require"
  username: "ml_user"
  password: "secure_password"

---
apiVersion: v1
kind: Secret
metadata:
  name: redis-secret
  namespace: ml-production
type: Opaque
stringData:
  url: "redis://redis-service:6379/0"
  password: "redis_secure_password"

---
apiVersion: v1
kind: Secret
metadata:
  name: registry-secret
  namespace: ml-production
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: eyJhdXRocyI6eyJ5b3VyLXJlZ2lzdHJ5LmNvbSI6eyJ1c2VybmFtZSI6InVzZXIiLCJwYXNzd29yZCI6InBhc3MiLCJhdXRoIjoiZFhObGNqcHdZWE56In19fQ==

yaml

Monitoring and Observability

Prometheus Monitoring

# monitoring.yaml
apiVersion: v1
kind: ServiceMonitor
metadata:
  name: ai-inference-monitor
  namespace: ml-production
  labels:
    app: ai-inference
spec:
  selector:
    matchLabels:
      app: ai-inference
  endpoints:
  - port: http
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s

---
# Grafana Dashboard ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: ai-inference-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  ai-inference-dashboard.json: |
    {
      "dashboard": {
        "id": null,
        "title": "AI Inference Service",
        "tags": ["ai", "ml", "inference"],
        "timezone": "browser",
        "panels": [
          {
            "title": "Request Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(http_requests_total{job=\"ai-inference\"}[5m])",
                "legendFormat": "{{ method }} {{ status }}"
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
          },
          {
            "title": "Inference Latency",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, rate(inference_duration_seconds_bucket[5m]))",
                "legendFormat": "95th percentile"
              },
              {
                "expr": "histogram_quantile(0.50, rate(inference_duration_seconds_bucket[5m]))",
                "legendFormat": "50th percentile"
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
          },
          {
            "title": "GPU Utilization",
            "type": "graph",
            "targets": [
              {
                "expr": "DCGM_FI_DEV_GPU_UTIL{pod=~\"ai-inference.*\"}",
                "legendFormat": "GPU {{ instance }}"
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
          },
          {
            "title": "Model Accuracy",
            "type": "stat",
            "targets": [
              {
                "expr": "model_accuracy{job=\"ai-inference\"}",
                "legendFormat": "Current Accuracy"
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
          }
        ],
        "refresh": "30s"
      }
    }

---
# PrometheusRule for alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ai-inference-alerts
  namespace: ml-production
spec:
  groups:
  - name: ai-inference.rules
    rules:
    - alert: AIInferenceHighLatency
      expr: histogram_quantile(0.95, rate(inference_duration_seconds_bucket[5m])) > 2.0
      for: 5m
      labels:
        severity: warning
        service: ai-inference
      annotations:
        summary: "AI Inference service has high latency"
        description: "95th percentile latency is {{ $value }}s for more than 5 minutes"

    - alert: AIInferenceHighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
      for: 2m
      labels:
        severity: critical
        service: ai-inference
      annotations:
        summary: "AI Inference service has high error rate"
        description: "Error rate is {{ $value | humanizePercentage }} for more than 2 minutes"

    - alert: ModelAccuracyDrop
      expr: model_accuracy < 0.85
      for: 10m
      labels:
        severity: critical
        service: ai-inference
      annotations:
        summary: "Model accuracy has dropped significantly"
        description: "Model accuracy is {{ $value | humanizePercentage }}, below 85% threshold"

    - alert: GPUMemoryHigh
      expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.9
      for: 5m
      labels:
        severity: warning
        service: ai-inference
      annotations:
        summary: "GPU memory usage is high"
        description: "GPU memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

yaml

Performance Optimization

Resource Management and Node Configuration

# gpu-node-setup.yaml
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
  labels:
    accelerator: nvidia-tesla-t4
    node-type: gpu-optimized
    zone: us-west-2a
spec:
  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

---
# Priority Class for AI workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-workload-priority
value: 1000
globalDefault: false
description: "Priority class for AI inference workloads"

---
# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ai-inference-pdb
  namespace: ml-production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: ai-inference

yaml

Network Policies for Security

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ai-inference-netpol
  namespace: ml-production
spec:
  podSelector:
    matchLabels:
      app: ai-inference
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Allow traffic from istio gateway
  - from:
    - namespaceSelector:
        matchLabels:
          name: istio-system
    ports:
    - protocol: TCP
      port: 8080
  # Allow traffic from monitoring
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080
  egress:
  # Allow DNS resolution
  - to: []
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
  # Allow access to Redis
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379
  # Allow access to PostgreSQL
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  # Allow access to model storage (S3)
  - to: []
    ports:
    - protocol: TCP
      port: 443

yaml

Disaster Recovery and Backup

Backup Strategy

# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: model-backup
  namespace: ml-production
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup-container
            image: amazon/aws-cli:latest
            command:
            - sh
            - -c
            - |
              echo "Starting model backup..."
              aws s3 sync /app/models s3://ml-model-backups/$(date +%Y-%m-%d)/
              echo "Backup completed successfully"
            volumeMounts:
            - name: model-storage
              mountPath: /app/models
              readOnly: true
            env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access-key-id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: secret-access-key
          volumes:
          - name: model-storage
            persistentVolumeClaim:
              claimName: model-pvc

yaml

CI/CD Pipeline Integration

GitOps Deployment

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: ml-production

resources:
- ai-inference-deployment.yaml
- model-storage.yaml
- secrets.yaml
- monitoring.yaml
- network-policy.yaml

images:
- name: your-repo/ai-inference
  newTag: v1.2.0

configMapGenerator:
- name: inference-config
  files:
  - configs/app.yaml
- name: logging-config
  files:
  - configs/logging.conf

secretGenerator:
- name: api-keys
  envs:
  - secrets/.env

patchesStrategicMerge:
- patches/production-resources.yaml

yaml

Testing and Validation

Health Check Implementation

# health_checks.py
from flask import Flask, jsonify
import torch
import psutil
import time
import logging

app = Flask(__name__)
logger = logging.getLogger(__name__)

class HealthChecker:
    def __init__(self):
        self.startup_time = time.time()
        self.model_loaded = False

    def check_model_health(self):
        """Check if model is loaded and functioning"""
        try:
            # Perform a quick inference test
            test_input = torch.randn(1, 10)  # Example input
            with torch.no_grad():
                output = self.model(test_input)
            return True, "Model is healthy"
        except Exception as e:
            return False, f"Model health check failed: {str(e)}"

    def check_gpu_health(self):
        """Check GPU availability and memory"""
        if not torch.cuda.is_available():
            return False, "CUDA not available"

        try:
            gpu_memory = torch.cuda.get_device_properties(0).total_memory
            gpu_used = torch.cuda.memory_allocated(0)
            gpu_utilization = (gpu_used / gpu_memory) * 100

            if gpu_utilization > 95:
                return False, f"GPU memory usage too high: {gpu_utilization:.1f}%"

            return True, f"GPU healthy, memory usage: {gpu_utilization:.1f}%"
        except Exception as e:
            return False, f"GPU check failed: {str(e)}"

    def check_system_resources(self):
        """Check system CPU and memory"""
        cpu_percent = psutil.cpu_percent(interval=1)
        memory = psutil.virtual_memory()

        if cpu_percent > 90:
            return False, f"CPU usage too high: {cpu_percent}%"

        if memory.percent > 90:
            return False, f"Memory usage too high: {memory.percent}%"

        return True, f"System healthy - CPU: {cpu_percent}%, Memory: {memory.percent}%"

health_checker = HealthChecker()

@app.route('/health')
def health():
    """Liveness probe endpoint"""
    checks = {
        'status': 'healthy',
        'timestamp': time.time(),
        'uptime': time.time() - health_checker.startup_time
    }

    # Basic health checks
    model_ok, model_msg = health_checker.check_model_health()
    system_ok, system_msg = health_checker.check_system_resources()

    checks['model'] = {'status': 'ok' if model_ok else 'error', 'message': model_msg}
    checks['system'] = {'status': 'ok' if system_ok else 'error', 'message': system_msg}

    if torch.cuda.is_available():
        gpu_ok, gpu_msg = health_checker.check_gpu_health()
        checks['gpu'] = {'status': 'ok' if gpu_ok else 'error', 'message': gpu_msg}

    overall_status = all([model_ok, system_ok])
    status_code = 200 if overall_status else 500

    return jsonify(checks), status_code

@app.route('/ready')
def ready():
    """Readiness probe endpoint"""
    if not health_checker.model_loaded:
        return jsonify({
            'status': 'not ready',
            'message': 'Model not loaded yet'
        }), 503

    return jsonify({
        'status': 'ready',
        'message': 'Service ready to accept requests'
    }), 200

@app.route('/startup')
def startup():
    """Startup probe endpoint"""
    startup_time_elapsed = time.time() - health_checker.startup_time

    if startup_time_elapsed < 30:  # Give 30 seconds for startup
        return jsonify({
            'status': 'starting',
            'elapsed_time': startup_time_elapsed
        }), 200

    return jsonify({
        'status': 'started',
        'elapsed_time': startup_time_elapsed
    }), 200

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

python

Best Practices and Lessons Learned

🏗️ Infrastructure Best Practices

GPU Resource Management

Use node selectors and taints for GPU nodes
Implement proper resource requests and limits
Monitor GPU utilization and memory usage

Model Management

Version your models consistently
Implement blue-green deployments for model updates
Use init containers for model downloading

Scaling Strategy

Configure HPA with multiple metrics
Use predictive scaling for known traffic patterns
Implement queue-based scaling for batch processing

⚡ Performance Optimization

Model Optimization

Use model quantization and pruning
Implement batching for better GPU utilization
Cache frequently accessed models

Resource Optimization

Profile memory usage and optimize accordingly
Use CPU nodes for preprocessing, GPU nodes for inference
Implement efficient data pipelines

🔒 Security Considerations

Network Security

Implement network policies
Use service mesh for encrypted communication
Limit egress traffic to necessary services

Data Protection

Encrypt data at rest and in transit
Implement proper RBAC
Use secrets management for sensitive data

Troubleshooting Common Issues

Model Loading Problems

# Check model file permissions
kubectl exec -it ai-inference-app-xxx -- ls -la /app/models/

# Check init container logs
kubectl logs ai-inference-app-xxx -c model-downloader

# Verify model configmap
kubectl describe configmap model-config -n ml-production

bash

GPU Resource Issues

# Check GPU availability
kubectl describe nodes -l accelerator=nvidia-tesla-t4

# Check GPU plugin daemonset
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds

# Monitor GPU usage
kubectl top nodes -l accelerator=nvidia-tesla-t4

bash

Performance Bottlenecks

# Check HPA status
kubectl describe hpa ai-inference-hpa -n ml-production

# Monitor custom metrics
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/ml-production/pods" | jq

# Check resource usage
kubectl top pods -n ml-production -l app=ai-inference

bash

Conclusion

Deploying resilient AI applications on Kubernetes requires careful consideration of:

✅ Resource Management: Proper GPU allocation and scaling strategies ✅ Model Lifecycle: Version control, deployment, and rollback procedures ✅ Monitoring & Observability: Comprehensive metrics, logging, and alerting ✅ Security: Network policies, RBAC, and secrets management ✅ High Availability: Pod anti-affinity, disruption budgets, and multi-zone deployment

Advanced Patterns and Future Considerations

Multi-Model Serving Architecture

For organizations running multiple AI models, consider implementing a multi-model serving pattern:

# multi-model-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-model-server
  namespace: ml-production
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: model-server
        image: tensorflow/serving:latest-gpu
        ports:
        - containerPort: 8501
          name: http
        - containerPort: 8500
          name: grpc
        env:
        - name: MODEL_CONFIG_FILE
          value: /app/config/models.config
        - name: MONITORING_CONFIG_FILE
          value: /app/config/monitoring.config
        volumeMounts:
        - name: models-volume
          mountPath: /models
        - name: config-volume
          mountPath: /app/config
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1
      volumes:
      - name: models-volume
        persistentVolumeClaim:
          claimName: models-shared-pvc
      - name: config-volume
        configMap:
          name: model-serving-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: model-serving-config
  namespace: ml-production
data:
  models.config: |
    model_config_list {
      config {
        name: 'sentiment_analysis'
        base_path: '/models/sentiment'
        model_platform: 'tensorflow'
        model_version_policy: {
          specific {
            versions: 1
            versions: 2
          }
        }
      }
      config {
        name: 'image_classification'
        base_path: '/models/image_classifier'
        model_platform: 'tensorflow'
        model_version_policy: {
          latest {
            num_versions: 2
          }
        }
      }
      config {
        name: 'recommendation_engine'
        base_path: '/models/recommendations'
        model_platform: 'tensorflow'
        model_version_policy: {
          all {}
        }
      }
    }

  monitoring.config: |
    prometheus_config {
      enable: true,
      path: "/monitoring/prometheus/metrics"
    }

yaml

Canary Deployments for AI Models

Implement safe model rollouts using Istio for traffic splitting:

# canary-deployment.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ai-inference-rollout
  namespace: ml-production
spec:
  replicas: 10
  strategy:
    canary:
      maxSurge: "25%"
      maxUnavailable: 0
      analysis:
        templates:
        - templateName: model-accuracy-analysis
        startingStep: 2
        args:
        - name: service-name
          value: ai-inference-canary
      steps:
      - setWeight: 10
      - pause: {duration: 30s}
      - setWeight: 20
      - pause: {duration: 30s}
      - analysis:
          templates:
          - templateName: model-performance-analysis
          args:
          - name: service-name
            value: ai-inference-canary
      - setWeight: 50
      - pause: {duration: 2m}
      - setWeight: 100
  selector:
    matchLabels:
      app: ai-inference
  template:
    metadata:
      labels:
        app: ai-inference
    spec:
      containers:
      - name: inference-container
        image: your-repo/ai-inference:v1.3.0
        # ... rest of container spec
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: model-accuracy-analysis
  namespace: ml-production
spec:
  args:
  - name: service-name
  metrics:
  - name: accuracy
    interval: 30s
    count: 10
    successCondition: result[0] >= 0.85
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus-server.monitoring.svc.cluster.local:80
        query: |
          model_accuracy{service="{{args.service-name}}"}
  - name: error-rate
    interval: 30s
    count: 10
    successCondition: result <= 0.05
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus-server.monitoring.svc.cluster.local:80
        query: |
          rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[2m]) /
          rate(http_requests_total{service="{{args.service-name}}"}[2m])

yaml

Edge AI Deployment

For edge computing scenarios, use lightweight deployments:

# edge-deployment.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: edge-ai-inference
  namespace: edge-computing
spec:
  selector:
    matchLabels:
      app: edge-ai-inference
  template:
    metadata:
      labels:
        app: edge-ai-inference
    spec:
      nodeSelector:
        node-type: edge
      tolerations:
      - key: edge-node
        operator: Exists
      containers:
      - name: edge-inference
        image: your-repo/edge-ai-inference:latest
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        env:
        - name: EDGE_MODE
          value: "true"
        - name: MODEL_COMPRESSION
          value: "quantized"
        - name: BATCH_SIZE
          value: "1"  # Real-time inference
        volumeMounts:
        - name: edge-models
          mountPath: /app/models
        - name: local-cache
          mountPath: /tmp/cache
      volumes:
      - name: edge-models
        hostPath:
          path: /opt/edge-models
      - name: local-cache
        emptyDir:
          sizeLimit: 1Gi

yaml

Batch Processing Jobs

For training and batch inference workloads:

# batch-training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: model-training-job-v1-3-0
  namespace: ml-training
spec:
  parallelism: 4
  completions: 1
  backoffLimit: 3
  ttlSecondsAfterFinished: 86400  # Clean up after 24 hours
  template:
    metadata:
      labels:
        job: model-training
        version: v1.3.0
    spec:
      restartPolicy: Never
      nodeSelector:
        accelerator: nvidia-tesla-v100
      containers:
      - name: trainer
        image: your-repo/model-trainer:latest
        command: ["python", "train.py"]
        args:
        - "--epochs=100"
        - "--batch-size=64"
        - "--learning-rate=0.001"
        - "--model-version=v1.3.0"
        - "--data-path=/data"
        - "--output-path=/models"
        resources:
          requests:
            nvidia.com/gpu: 2
            memory: "16Gi"
            cpu: "4000m"
          limits:
            nvidia.com/gpu: 2
            memory: "32Gi"
            cpu: "8000m"
        volumeMounts:
        - name: training-data
          mountPath: /data
          readOnly: true
        - name: model-output
          mountPath: /models
        - name: checkpoints
          mountPath: /checkpoints
        env:
        - name: WANDB_API_KEY
          valueFrom:
            secretKeyRef:
              name: wandb-secret
              key: api-key
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: access-key-id
      volumes:
      - name: training-data
        persistentVolumeClaim:
          claimName: training-data-pvc
      - name: model-output
        persistentVolumeClaim:
          claimName: models-pvc
      - name: checkpoints
        emptyDir:
          sizeLimit: 50Gi
---
# Batch inference job
apiVersion: batch/v1
kind: CronJob
metadata:
  name: batch-inference-job
  namespace: ml-production
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      parallelism: 10
      completions: 10
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: batch-processor
            image: your-repo/batch-inference:latest
            resources:
              requests:
                nvidia.com/gpu: 1
                memory: "8Gi"
                cpu: "2000m"
            env:
            - name: BATCH_SIZE
              value: "1000"
            - name: INPUT_PATH
              value: "s3://ml-data/batch-input/"
            - name: OUTPUT_PATH
              value: "s3://ml-results/batch-output/"

yaml

Cost Optimization Strategies

Spot Instance Integration

# spot-node-pool.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: spot-node-config
  namespace: kube-system
data:
  userdata: |
    #!/bin/bash
    /etc/eks/bootstrap.sh ml-cluster-spot \
      --container-runtime containerd \
      --kubelet-extra-args '--node-labels=spot=true,workload=ml-inference'
---
# Spot-tolerant deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-spot
  namespace: ml-production
spec:
  replicas: 5
  template:
    spec:
      tolerations:
      - key: spot
        operator: Equal
        value: "true"
        effect: NoSchedule
      nodeSelector:
        spot: "true"
      priorityClassName: spot-workload-priority
      containers:
      - name: inference-container
        # ... container spec
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                # Graceful shutdown
                echo "Received SIGTERM, starting graceful shutdown..."
                curl -X POST http://localhost:8080/shutdown
                sleep 30

yaml

Resource Optimization

# vertical-pod-autoscaler.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: ai-inference-vpa
  namespace: ml-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: inference-container
      minAllowed:
        cpu: 500m
        memory: 2Gi
      maxAllowed:
        cpu: 4000m
        memory: 16Gi
      controlledResources: ["cpu", "memory"]

yaml

Security Hardening

Pod Security Standards

# pod-security-policy.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ml-production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: ai-inference-psp
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'

yaml

Service Mesh Security

# istio-security.yaml
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: ai-inference-mtls
  namespace: ml-production
spec:
  selector:
    matchLabels:
      app: ai-inference
  mtls:
    mode: STRICT
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: ai-inference-authz
  namespace: ml-production
spec:
  selector:
    matchLabels:
      app: ai-inference
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/istio-system/sa/istio-ingressgateway"]
  - from:
    - source:
        namespaces: ["monitoring"]
  - to:
    - operation:
        methods: ["GET", "POST"]
        paths: ["/predict", "/health", "/metrics"]

yaml

Disaster Recovery and Business Continuity

Multi-Region Deployment

# multi-region-deployment.yaml
apiVersion: v1
kind: Service
metadata:
  name: ai-inference-global
  annotations:
    external-dns.alpha.kubernetes.io/hostname: ai-api.yourcompany.com
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
  type: LoadBalancer
  selector:
    app: ai-inference
  ports:
  - port: 80
    targetPort: 8080
---
# Cross-region data replication
apiVersion: batch/v1
kind: CronJob
metadata:
  name: model-sync-job
spec:
  schedule: "*/30 * * * *"  # Every 30 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: sync-models
            image: amazon/aws-cli:latest
            command:
            - sh
            - -c
            - |
              # Sync models between regions
              aws s3 sync s3://ml-models-us-west-2/ s3://ml-models-us-east-1/ --delete
              aws s3 sync s3://ml-models-us-west-2/ s3://ml-models-eu-west-1/ --delete
          restartPolicy: OnFailure

yaml

Performance Benchmarking

Load Testing Configuration

# load-test-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: ai-inference-load-test
  namespace: ml-testing
spec:
  parallelism: 10
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: load-tester
        image: loadimpact/k6:latest
        command: ["k6", "run", "--vus=100", "--duration=5m", "/scripts/load-test.js"]
        volumeMounts:
        - name: test-scripts
          mountPath: /scripts
        env:
        - name: TARGET_URL
          value: "http://ai-inference-service.ml-production.svc.cluster.local"
      volumes:
      - name: test-scripts
        configMap:
          name: load-test-scripts
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: load-test-scripts
  namespace: ml-testing
data:
  load-test.js: |
    import http from 'k6/http';
    import { check, sleep } from 'k6';

    export let options = {
      vus: 100,
      duration: '5m',
      thresholds: {
        http_req_duration: ['p(95)<2000'], // 95% of requests under 2s
        http_req_failed: ['rate<0.05'],    // Error rate under 5%
      },
    };

    export default function() {
      const payload = JSON.stringify({
        text: "This is a sample text for sentiment analysis",
        model_version: "v1.2.0"
      });

      const params = {
        headers: {
          'Content-Type': 'application/json',
        },
      };

      let response = http.post(`${__ENV.TARGET_URL}/predict`, payload, params);

      check(response, {
        'status is 200': (r) => r.status === 200,
        'response time < 2000ms': (r) => r.timings.duration < 2000,
        'accuracy > 0.8': (r) => JSON.parse(r.body).confidence > 0.8,
      });

      sleep(Math.random() * 2);
    }

yaml

Conclusion

Deploying resilient AI applications on Kubernetes is a complex but rewarding endeavor. By following the patterns and practices outlined in this guide, you can build:

🎯 Production-Ready AI Systems

Scalable Infrastructure: Auto-scaling based on demand and custom metrics
High Availability: Multi-zone deployments with disaster recovery
Performance Optimization: GPU utilization and efficient resource management
Security Hardening: Network policies, mTLS, and Pod Security Standards

📈 Operational Excellence

Comprehensive Monitoring: Metrics, logging, and alerting for all components
Automated Operations: CI/CD pipelines and GitOps workflows
Cost Optimization: Spot instances and right-sizing strategies
Testing & Validation: Load testing and canary deployments

🔄 Future-Proof Architecture

Multi-Model Serving: Support for various AI/ML frameworks
Edge Computing: Lightweight deployments for edge scenarios
Batch Processing: Training and batch inference capabilities
Multi-Region: Global deployment and data replication

The key to success lies in starting simple, measuring everything, and iteratively improving your deployment based on real-world usage patterns and requirements.

🚀 Next Steps

Start Small: Begin with a single model deployment and gradually add complexity
Monitor Everything: Implement comprehensive observability from day one
Automate Relentlessly: Build CI/CD pipelines and automated testing
Plan for Scale: Design for growth from the beginning
Stay Secure: Implement security best practices throughout the stack

With this foundation, you're well-equipped to deploy and operate resilient AI applications that can handle production workloads at scale while maintaining high availability, security, and performance.

Need help implementing AI on Kubernetes? Our team specializes in cloud-native AI deployments and can help you build production-ready systems. Contact us for a consultation on your AI infrastructure needs.