Deploying a Resilient AI Application on Kubernetes
A deep dive into the architecture for running scalable and fault-tolerant AI workloads on a Kubernetes cluster.
Deploying a Resilient AI Application on Kubernetes
Introduction
Running stateful, resource-intensive AI applications in production requires more than just a powerful model. It demands a resilient, scalable, and observable infrastructure. Kubernetes, with its powerful orchestration capabilities, provides the perfect foundation. In this post, we'll explore the best practices for deploying a production-grade AI application on K8s.
Modern AI workloads present unique challenges: they're computationally intensive, require specialized hardware (GPUs/TPUs), handle large models and datasets, and need to serve real-time predictions at scale. Traditional deployment approaches often fall short when dealing with these requirements.
Architecture Overview
Our resilient AI application architecture follows cloud-native principles while addressing the specific needs of machine learning workloads:
Key Architectural Components
Our architecture consists of several key components working in tandem:
1. Containerization
The first step is to containerize our AI application using Docker with optimized base images and efficient model loading.
2. StatefulSets
For components that require stable network identifiers and persistent storage, such as model servers with cached models.
3. Horizontal Pod Autoscalers (HPA)
To automatically scale our inference services based on CPU, memory, or custom metrics like inference latency.
4. Custom Resource Definitions (CRDs)
For managing AI-specific resources like model versions, training jobs, and inference endpoints.
Complete Kubernetes Deployment
Let's start with a comprehensive deployment configuration:
# ai-inference-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-app
namespace: ml-production
labels:
app: ai-inference
version: v1.2.0
component: inference-server
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 2
selector:
matchLabels:
app: ai-inference
template:
metadata:
labels:
app: ai-inference
version: v1.2.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
# Force pod restart when model configmap changes
checksum/model-config: "{{ include (print $.Template.BasePath \"/model-configmap.yaml\") . | sha256sum }}"
spec:
# Security context
securityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
# Node selection for GPU nodes
nodeSelector:
accelerator: nvidia-tesla-t4
# Tolerations for GPU nodes
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Pod anti-affinity for high availability
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ai-inference
topologyKey: kubernetes.io/hostname
# Init container for model loading
initContainers:
- name: model-downloader
image: alpine/curl:latest
command:
- sh
- -c
- |
echo "Downloading model files..."
curl -L $MODEL_URL -o /shared/model.pkl
echo "Model download completed"
env:
- name: MODEL_URL
valueFrom:
configMapKeyRef:
name: model-config
key: model-url
volumeMounts:
- name: model-storage
mountPath: /shared
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
containers:
- name: inference-container
image: your-repo/ai-inference:v1.2.0
imagePullPolicy: Always
ports:
- containerPort: 8080
name: http
protocol: TCP
- containerPort: 8090
name: grpc
protocol: TCP
env:
- name: MODEL_PATH
value: "/app/models/model.pkl"
- name: BATCH_SIZE
value: "32"
- name: MAX_WORKERS
value: "4"
- name: LOG_LEVEL
value: "INFO"
- name: PROMETHEUS_PORT
value: "8080"
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: redis-secret
key: url
- name: DB_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: postgres-secret
key: connection-string
# Resource requirements with GPU
resources:
requests:
memory: "4Gi"
cpu: "1000m"
nvidia.com/gpu: 1
limits:
memory: "8Gi"
cpu: "2000m"
nvidia.com/gpu: 1
# Health checks
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
successThreshold: 1
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
# Startup probe for slow model loading
startupProbe:
httpGet:
path: /startup
port: http
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30
successThreshold: 1
# Volume mounts
volumeMounts:
- name: model-storage
mountPath: /app/models
readOnly: true
- name: cache-volume
mountPath: /tmp/cache
- name: config-volume
mountPath: /app/config
readOnly: true
- name: logging-config
mountPath: /app/logging.conf
subPath: logging.conf
readOnly: true
# Volumes
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
- name: cache-volume
emptyDir:
sizeLimit: 2Gi
- name: config-volume
configMap:
name: inference-config
- name: logging-config
configMap:
name: logging-config
# Image pull secrets
imagePullSecrets:
- name: registry-secret
---
# Service for the inference application
apiVersion: v1
kind: Service
metadata:
name: ai-inference-service
namespace: ml-production
labels:
app: ai-inference
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
- port: 9090
targetPort: grpc
protocol: TCP
name: grpc
selector:
app: ai-inference
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-inference-hpa
namespace: ml-production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-inference-app
minReplicas: 3
maxReplicas: 20
metrics:
# CPU-based scaling
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory-based scaling
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metric: inference latency
- type: Pods
pods:
metric:
name: inference_latency_p95
target:
type: AverageValue
averageValue: "500m" # 500ms
# Custom metric: queue depth
- type: Pods
pods:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: "10"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
yamlModel Storage and Management
For efficient model management, we use a combination of persistent volumes and model versioning:
# model-storage.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-pvc
namespace: ml-production
spec:
accessModes:
- ReadWriteOnce
storageClassName: ssd-retain
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: model-config
namespace: ml-production
data:
model-url: "https://your-model-store.com/models/v1.2.0/model.pkl"
model-version: "v1.2.0"
model-checksum: "sha256:abc123..."
batch-size: "32"
max-sequence-length: "512"
---
# Model management CRD
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: modelversions.ml.example.com
spec:
group: ml.example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
modelName:
type: string
version:
type: string
modelUri:
type: string
framework:
type: string
replicas:
type: integer
resources:
type: object
status:
type: object
properties:
phase:
type: string
message:
type: string
scope: Namespaced
names:
plural: modelversions
singular: modelversion
kind: ModelVersion
---
# Model version instance
apiVersion: ml.example.com/v1
kind: ModelVersion
metadata:
name: sentiment-model-v1-2-0
namespace: ml-production
spec:
modelName: sentiment-analyzer
version: "v1.2.0"
modelUri: "s3://ml-models/sentiment/v1.2.0/"
framework: "pytorch"
replicas: 3
resources:
requests:
memory: "4Gi"
cpu: "1000m"
nvidia.com/gpu: 1
limits:
memory: "8Gi"
cpu: "2000m"
nvidia.com/gpu: 1
yamlAdvanced Configuration
ConfigMaps for Application Configuration
# inference-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-config
namespace: ml-production
data:
app.yaml: |
server:
host: 0.0.0.0
port: 8080
workers: 4
timeout: 30
model:
batch_size: 32
max_sequence_length: 512
warmup_requests: 10
cache:
enabled: true
ttl: 3600
max_size: 1000
monitoring:
metrics_port: 8080
health_check_interval: 30
features:
preprocessing: true
postprocessing: true
batch_inference: true
streaming: false
---
apiVersion: v1
kind: ConfigMap
metadata:
name: logging-config
namespace: ml-production
data:
logging.conf: |
[loggers]
keys=root,inference,model
[handlers]
keys=consoleHandler,fileHandler
[formatters]
keys=detailedFormatter
[logger_root]
level=INFO
handlers=consoleHandler
[logger_inference]
level=DEBUG
handlers=consoleHandler,fileHandler
qualname=inference
propagate=0
[logger_model]
level=INFO
handlers=consoleHandler,fileHandler
qualname=model
propagate=0
[handler_consoleHandler]
class=StreamHandler
level=INFO
formatter=detailedFormatter
args=(sys.stdout,)
[handler_fileHandler]
class=FileHandler
level=DEBUG
formatter=detailedFormatter
args=('/var/log/app.log',)
[formatter_detailedFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
datefmt=%Y-%m-%d %H:%M:%S
yamlSecrets Management
# secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: postgres-secret
namespace: ml-production
type: Opaque
stringData:
connection-string: "postgresql://ml_user:secure_password@postgres-service:5432/ml_db?sslmode=require"
username: "ml_user"
password: "secure_password"
---
apiVersion: v1
kind: Secret
metadata:
name: redis-secret
namespace: ml-production
type: Opaque
stringData:
url: "redis://redis-service:6379/0"
password: "redis_secure_password"
---
apiVersion: v1
kind: Secret
metadata:
name: registry-secret
namespace: ml-production
type: kubernetes.io/dockerconfigjson
data:
.dockerconfigjson: eyJhdXRocyI6eyJ5b3VyLXJlZ2lzdHJ5LmNvbSI6eyJ1c2VybmFtZSI6InVzZXIiLCJwYXNzd29yZCI6InBhc3MiLCJhdXRoIjoiZFhObGNqcHdZWE56In19fQ==
yamlMonitoring and Observability
Prometheus Monitoring
# monitoring.yaml
apiVersion: v1
kind: ServiceMonitor
metadata:
name: ai-inference-monitor
namespace: ml-production
labels:
app: ai-inference
spec:
selector:
matchLabels:
app: ai-inference
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
---
# Grafana Dashboard ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: ai-inference-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
ai-inference-dashboard.json: |
{
"dashboard": {
"id": null,
"title": "AI Inference Service",
"tags": ["ai", "ml", "inference"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{job=\"ai-inference\"}[5m])",
"legendFormat": "{{ method }} {{ status }}"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"title": "Inference Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(inference_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.50, rate(inference_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
},
{
"title": "GPU Utilization",
"type": "graph",
"targets": [
{
"expr": "DCGM_FI_DEV_GPU_UTIL{pod=~\"ai-inference.*\"}",
"legendFormat": "GPU {{ instance }}"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
},
{
"title": "Model Accuracy",
"type": "stat",
"targets": [
{
"expr": "model_accuracy{job=\"ai-inference\"}",
"legendFormat": "Current Accuracy"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
}
],
"refresh": "30s"
}
}
---
# PrometheusRule for alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ai-inference-alerts
namespace: ml-production
spec:
groups:
- name: ai-inference.rules
rules:
- alert: AIInferenceHighLatency
expr: histogram_quantile(0.95, rate(inference_duration_seconds_bucket[5m])) > 2.0
for: 5m
labels:
severity: warning
service: ai-inference
annotations:
summary: "AI Inference service has high latency"
description: "95th percentile latency is {{ $value }}s for more than 5 minutes"
- alert: AIInferenceHighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
service: ai-inference
annotations:
summary: "AI Inference service has high error rate"
description: "Error rate is {{ $value | humanizePercentage }} for more than 2 minutes"
- alert: ModelAccuracyDrop
expr: model_accuracy < 0.85
for: 10m
labels:
severity: critical
service: ai-inference
annotations:
summary: "Model accuracy has dropped significantly"
description: "Model accuracy is {{ $value | humanizePercentage }}, below 85% threshold"
- alert: GPUMemoryHigh
expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.9
for: 5m
labels:
severity: warning
service: ai-inference
annotations:
summary: "GPU memory usage is high"
description: "GPU memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
yamlPerformance Optimization
Resource Management and Node Configuration
# gpu-node-setup.yaml
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
labels:
accelerator: nvidia-tesla-t4
node-type: gpu-optimized
zone: us-west-2a
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
---
# Priority Class for AI workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ai-workload-priority
value: 1000
globalDefault: false
description: "Priority class for AI inference workloads"
---
# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: ai-inference-pdb
namespace: ml-production
spec:
minAvailable: 2
selector:
matchLabels:
app: ai-inference
yamlNetwork Policies for Security
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ai-inference-netpol
namespace: ml-production
spec:
podSelector:
matchLabels:
app: ai-inference
policyTypes:
- Ingress
- Egress
ingress:
# Allow traffic from istio gateway
- from:
- namespaceSelector:
matchLabels:
name: istio-system
ports:
- protocol: TCP
port: 8080
# Allow traffic from monitoring
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 8080
egress:
# Allow DNS resolution
- to: []
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow access to Redis
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
# Allow access to PostgreSQL
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
# Allow access to model storage (S3)
- to: []
ports:
- protocol: TCP
port: 443
yamlDisaster Recovery and Backup
Backup Strategy
# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: model-backup
namespace: ml-production
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup-container
image: amazon/aws-cli:latest
command:
- sh
- -c
- |
echo "Starting model backup..."
aws s3 sync /app/models s3://ml-model-backups/$(date +%Y-%m-%d)/
echo "Backup completed successfully"
volumeMounts:
- name: model-storage
mountPath: /app/models
readOnly: true
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
yamlCI/CD Pipeline Integration
GitOps Deployment
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: ml-production
resources:
- ai-inference-deployment.yaml
- model-storage.yaml
- secrets.yaml
- monitoring.yaml
- network-policy.yaml
images:
- name: your-repo/ai-inference
newTag: v1.2.0
configMapGenerator:
- name: inference-config
files:
- configs/app.yaml
- name: logging-config
files:
- configs/logging.conf
secretGenerator:
- name: api-keys
envs:
- secrets/.env
patchesStrategicMerge:
- patches/production-resources.yaml
yamlTesting and Validation
Health Check Implementation
# health_checks.py
from flask import Flask, jsonify
import torch
import psutil
import time
import logging
app = Flask(__name__)
logger = logging.getLogger(__name__)
class HealthChecker:
def __init__(self):
self.startup_time = time.time()
self.model_loaded = False
def check_model_health(self):
"""Check if model is loaded and functioning"""
try:
# Perform a quick inference test
test_input = torch.randn(1, 10) # Example input
with torch.no_grad():
output = self.model(test_input)
return True, "Model is healthy"
except Exception as e:
return False, f"Model health check failed: {str(e)}"
def check_gpu_health(self):
"""Check GPU availability and memory"""
if not torch.cuda.is_available():
return False, "CUDA not available"
try:
gpu_memory = torch.cuda.get_device_properties(0).total_memory
gpu_used = torch.cuda.memory_allocated(0)
gpu_utilization = (gpu_used / gpu_memory) * 100
if gpu_utilization > 95:
return False, f"GPU memory usage too high: {gpu_utilization:.1f}%"
return True, f"GPU healthy, memory usage: {gpu_utilization:.1f}%"
except Exception as e:
return False, f"GPU check failed: {str(e)}"
def check_system_resources(self):
"""Check system CPU and memory"""
cpu_percent = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
if cpu_percent > 90:
return False, f"CPU usage too high: {cpu_percent}%"
if memory.percent > 90:
return False, f"Memory usage too high: {memory.percent}%"
return True, f"System healthy - CPU: {cpu_percent}%, Memory: {memory.percent}%"
health_checker = HealthChecker()
@app.route('/health')
def health():
"""Liveness probe endpoint"""
checks = {
'status': 'healthy',
'timestamp': time.time(),
'uptime': time.time() - health_checker.startup_time
}
# Basic health checks
model_ok, model_msg = health_checker.check_model_health()
system_ok, system_msg = health_checker.check_system_resources()
checks['model'] = {'status': 'ok' if model_ok else 'error', 'message': model_msg}
checks['system'] = {'status': 'ok' if system_ok else 'error', 'message': system_msg}
if torch.cuda.is_available():
gpu_ok, gpu_msg = health_checker.check_gpu_health()
checks['gpu'] = {'status': 'ok' if gpu_ok else 'error', 'message': gpu_msg}
overall_status = all([model_ok, system_ok])
status_code = 200 if overall_status else 500
return jsonify(checks), status_code
@app.route('/ready')
def ready():
"""Readiness probe endpoint"""
if not health_checker.model_loaded:
return jsonify({
'status': 'not ready',
'message': 'Model not loaded yet'
}), 503
return jsonify({
'status': 'ready',
'message': 'Service ready to accept requests'
}), 200
@app.route('/startup')
def startup():
"""Startup probe endpoint"""
startup_time_elapsed = time.time() - health_checker.startup_time
if startup_time_elapsed < 30: # Give 30 seconds for startup
return jsonify({
'status': 'starting',
'elapsed_time': startup_time_elapsed
}), 200
return jsonify({
'status': 'started',
'elapsed_time': startup_time_elapsed
}), 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
pythonBest Practices and Lessons Learned
๐๏ธ Infrastructure Best Practices
- GPU Resource Management
- Use node selectors and taints for GPU nodes
- Implement proper resource requests and limits
- Monitor GPU utilization and memory usage
- Model Management
- Version your models consistently
- Implement blue-green deployments for model updates
- Use init containers for model downloading
- Scaling Strategy
- Configure HPA with multiple metrics
- Use predictive scaling for known traffic patterns
- Implement queue-based scaling for batch processing
โก Performance Optimization
- Model Optimization
- Use model quantization and pruning
- Implement batching for better GPU utilization
- Cache frequently accessed models
- Resource Optimization
- Profile memory usage and optimize accordingly
- Use CPU nodes for preprocessing, GPU nodes for inference
- Implement efficient data pipelines
๐ Security Considerations
- Network Security
- Implement network policies
- Use service mesh for encrypted communication
- Limit egress traffic to necessary services
- Data Protection
- Encrypt data at rest and in transit
- Implement proper RBAC
- Use secrets management for sensitive data
Troubleshooting Common Issues
Model Loading Problems
# Check model file permissions
kubectl exec -it ai-inference-app-xxx -- ls -la /app/models/
# Check init container logs
kubectl logs ai-inference-app-xxx -c model-downloader
# Verify model configmap
kubectl describe configmap model-config -n ml-production
bashGPU Resource Issues
# Check GPU availability
kubectl describe nodes -l accelerator=nvidia-tesla-t4
# Check GPU plugin daemonset
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
# Monitor GPU usage
kubectl top nodes -l accelerator=nvidia-tesla-t4
bashPerformance Bottlenecks
# Check HPA status
kubectl describe hpa ai-inference-hpa -n ml-production
# Monitor custom metrics
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/ml-production/pods" | jq
# Check resource usage
kubectl top pods -n ml-production -l app=ai-inference
bashConclusion
Deploying resilient AI applications on Kubernetes requires careful consideration of:
โ Resource Management: Proper GPU allocation and scaling strategies โ Model Lifecycle: Version control, deployment, and rollback procedures โ Monitoring & Observability: Comprehensive metrics, logging, and alerting โ Security: Network policies, RBAC, and secrets management โ High Availability: Pod anti-affinity, disruption budgets, and multi-zone deployment
Advanced Patterns and Future Considerations
Multi-Model Serving Architecture
For organizations running multiple AI models, consider implementing a multi-model serving pattern:
# multi-model-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-model-server
namespace: ml-production
spec:
replicas: 5
template:
spec:
containers:
- name: model-server
image: tensorflow/serving:latest-gpu
ports:
- containerPort: 8501
name: http
- containerPort: 8500
name: grpc
env:
- name: MODEL_CONFIG_FILE
value: /app/config/models.config
- name: MONITORING_CONFIG_FILE
value: /app/config/monitoring.config
volumeMounts:
- name: models-volume
mountPath: /models
- name: config-volume
mountPath: /app/config
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
volumes:
- name: models-volume
persistentVolumeClaim:
claimName: models-shared-pvc
- name: config-volume
configMap:
name: model-serving-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: model-serving-config
namespace: ml-production
data:
models.config: |
model_config_list {
config {
name: 'sentiment_analysis'
base_path: '/models/sentiment'
model_platform: 'tensorflow'
model_version_policy: {
specific {
versions: 1
versions: 2
}
}
}
config {
name: 'image_classification'
base_path: '/models/image_classifier'
model_platform: 'tensorflow'
model_version_policy: {
latest {
num_versions: 2
}
}
}
config {
name: 'recommendation_engine'
base_path: '/models/recommendations'
model_platform: 'tensorflow'
model_version_policy: {
all {}
}
}
}
monitoring.config: |
prometheus_config {
enable: true,
path: "/monitoring/prometheus/metrics"
}
yamlCanary Deployments for AI Models
Implement safe model rollouts using Istio for traffic splitting:
# canary-deployment.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: ai-inference-rollout
namespace: ml-production
spec:
replicas: 10
strategy:
canary:
maxSurge: "25%"
maxUnavailable: 0
analysis:
templates:
- templateName: model-accuracy-analysis
startingStep: 2
args:
- name: service-name
value: ai-inference-canary
steps:
- setWeight: 10
- pause: {duration: 30s}
- setWeight: 20
- pause: {duration: 30s}
- analysis:
templates:
- templateName: model-performance-analysis
args:
- name: service-name
value: ai-inference-canary
- setWeight: 50
- pause: {duration: 2m}
- setWeight: 100
selector:
matchLabels:
app: ai-inference
template:
metadata:
labels:
app: ai-inference
spec:
containers:
- name: inference-container
image: your-repo/ai-inference:v1.3.0
# ... rest of container spec
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: model-accuracy-analysis
namespace: ml-production
spec:
args:
- name: service-name
metrics:
- name: accuracy
interval: 30s
count: 10
successCondition: result[0] >= 0.85
failureLimit: 3
provider:
prometheus:
address: http://prometheus-server.monitoring.svc.cluster.local:80
query: |
model_accuracy{service="{{args.service-name}}"}
- name: error-rate
interval: 30s
count: 10
successCondition: result <= 0.05
failureLimit: 3
provider:
prometheus:
address: http://prometheus-server.monitoring.svc.cluster.local:80
query: |
rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[2m]) /
rate(http_requests_total{service="{{args.service-name}}"}[2m])
yamlEdge AI Deployment
For edge computing scenarios, use lightweight deployments:
# edge-deployment.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: edge-ai-inference
namespace: edge-computing
spec:
selector:
matchLabels:
app: edge-ai-inference
template:
metadata:
labels:
app: edge-ai-inference
spec:
nodeSelector:
node-type: edge
tolerations:
- key: edge-node
operator: Exists
containers:
- name: edge-inference
image: your-repo/edge-ai-inference:latest
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
env:
- name: EDGE_MODE
value: "true"
- name: MODEL_COMPRESSION
value: "quantized"
- name: BATCH_SIZE
value: "1" # Real-time inference
volumeMounts:
- name: edge-models
mountPath: /app/models
- name: local-cache
mountPath: /tmp/cache
volumes:
- name: edge-models
hostPath:
path: /opt/edge-models
- name: local-cache
emptyDir:
sizeLimit: 1Gi
yamlBatch Processing Jobs
For training and batch inference workloads:
# batch-training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: model-training-job-v1-3-0
namespace: ml-training
spec:
parallelism: 4
completions: 1
backoffLimit: 3
ttlSecondsAfterFinished: 86400 # Clean up after 24 hours
template:
metadata:
labels:
job: model-training
version: v1.3.0
spec:
restartPolicy: Never
nodeSelector:
accelerator: nvidia-tesla-v100
containers:
- name: trainer
image: your-repo/model-trainer:latest
command: ["python", "train.py"]
args:
- "--epochs=100"
- "--batch-size=64"
- "--learning-rate=0.001"
- "--model-version=v1.3.0"
- "--data-path=/data"
- "--output-path=/models"
resources:
requests:
nvidia.com/gpu: 2
memory: "16Gi"
cpu: "4000m"
limits:
nvidia.com/gpu: 2
memory: "32Gi"
cpu: "8000m"
volumeMounts:
- name: training-data
mountPath: /data
readOnly: true
- name: model-output
mountPath: /models
- name: checkpoints
mountPath: /checkpoints
env:
- name: WANDB_API_KEY
valueFrom:
secretKeyRef:
name: wandb-secret
key: api-key
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: models-pvc
- name: checkpoints
emptyDir:
sizeLimit: 50Gi
---
# Batch inference job
apiVersion: batch/v1
kind: CronJob
metadata:
name: batch-inference-job
namespace: ml-production
spec:
schedule: "0 */6 * * *" # Every 6 hours
concurrencyPolicy: Forbid
jobTemplate:
spec:
parallelism: 10
completions: 10
template:
spec:
restartPolicy: Never
containers:
- name: batch-processor
image: your-repo/batch-inference:latest
resources:
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "2000m"
env:
- name: BATCH_SIZE
value: "1000"
- name: INPUT_PATH
value: "s3://ml-data/batch-input/"
- name: OUTPUT_PATH
value: "s3://ml-results/batch-output/"
yamlCost Optimization Strategies
Spot Instance Integration
# spot-node-pool.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: spot-node-config
namespace: kube-system
data:
userdata: |
#!/bin/bash
/etc/eks/bootstrap.sh ml-cluster-spot \
--container-runtime containerd \
--kubelet-extra-args '--node-labels=spot=true,workload=ml-inference'
---
# Spot-tolerant deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-spot
namespace: ml-production
spec:
replicas: 5
template:
spec:
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
spot: "true"
priorityClassName: spot-workload-priority
containers:
- name: inference-container
# ... container spec
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Graceful shutdown
echo "Received SIGTERM, starting graceful shutdown..."
curl -X POST http://localhost:8080/shutdown
sleep 30
yamlResource Optimization
# vertical-pod-autoscaler.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: ai-inference-vpa
namespace: ml-production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-inference-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: inference-container
minAllowed:
cpu: 500m
memory: 2Gi
maxAllowed:
cpu: 4000m
memory: 16Gi
controlledResources: ["cpu", "memory"]
yamlSecurity Hardening
Pod Security Standards
# pod-security-policy.yaml
apiVersion: v1
kind: Namespace
metadata:
name: ml-production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: ai-inference-psp
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
yamlService Mesh Security
# istio-security.yaml
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: ai-inference-mtls
namespace: ml-production
spec:
selector:
matchLabels:
app: ai-inference
mtls:
mode: STRICT
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: ai-inference-authz
namespace: ml-production
spec:
selector:
matchLabels:
app: ai-inference
rules:
- from:
- source:
principals: ["cluster.local/ns/istio-system/sa/istio-ingressgateway"]
- from:
- source:
namespaces: ["monitoring"]
- to:
- operation:
methods: ["GET", "POST"]
paths: ["/predict", "/health", "/metrics"]
yamlDisaster Recovery and Business Continuity
Multi-Region Deployment
# multi-region-deployment.yaml
apiVersion: v1
kind: Service
metadata:
name: ai-inference-global
annotations:
external-dns.alpha.kubernetes.io/hostname: ai-api.yourcompany.com
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
type: LoadBalancer
selector:
app: ai-inference
ports:
- port: 80
targetPort: 8080
---
# Cross-region data replication
apiVersion: batch/v1
kind: CronJob
metadata:
name: model-sync-job
spec:
schedule: "*/30 * * * *" # Every 30 minutes
jobTemplate:
spec:
template:
spec:
containers:
- name: sync-models
image: amazon/aws-cli:latest
command:
- sh
- -c
- |
# Sync models between regions
aws s3 sync s3://ml-models-us-west-2/ s3://ml-models-us-east-1/ --delete
aws s3 sync s3://ml-models-us-west-2/ s3://ml-models-eu-west-1/ --delete
restartPolicy: OnFailure
yamlPerformance Benchmarking
Load Testing Configuration
# load-test-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: ai-inference-load-test
namespace: ml-testing
spec:
parallelism: 10
template:
spec:
restartPolicy: Never
containers:
- name: load-tester
image: loadimpact/k6:latest
command: ["k6", "run", "--vus=100", "--duration=5m", "/scripts/load-test.js"]
volumeMounts:
- name: test-scripts
mountPath: /scripts
env:
- name: TARGET_URL
value: "http://ai-inference-service.ml-production.svc.cluster.local"
volumes:
- name: test-scripts
configMap:
name: load-test-scripts
---
apiVersion: v1
kind: ConfigMap
metadata:
name: load-test-scripts
namespace: ml-testing
data:
load-test.js: |
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
vus: 100,
duration: '5m',
thresholds: {
http_req_duration: ['p(95)<2000'], // 95% of requests under 2s
http_req_failed: ['rate<0.05'], // Error rate under 5%
},
};
export default function() {
const payload = JSON.stringify({
text: "This is a sample text for sentiment analysis",
model_version: "v1.2.0"
});
const params = {
headers: {
'Content-Type': 'application/json',
},
};
let response = http.post(`${__ENV.TARGET_URL}/predict`, payload, params);
check(response, {
'status is 200': (r) => r.status === 200,
'response time < 2000ms': (r) => r.timings.duration < 2000,
'accuracy > 0.8': (r) => JSON.parse(r.body).confidence > 0.8,
});
sleep(Math.random() * 2);
}
yamlConclusion
Deploying resilient AI applications on Kubernetes is a complex but rewarding endeavor. By following the patterns and practices outlined in this guide, you can build:
๐ฏ Production-Ready AI Systems
- Scalable Infrastructure: Auto-scaling based on demand and custom metrics
- High Availability: Multi-zone deployments with disaster recovery
- Performance Optimization: GPU utilization and efficient resource management
- Security Hardening: Network policies, mTLS, and Pod Security Standards
๐ Operational Excellence
- Comprehensive Monitoring: Metrics, logging, and alerting for all components
- Automated Operations: CI/CD pipelines and GitOps workflows
- Cost Optimization: Spot instances and right-sizing strategies
- Testing & Validation: Load testing and canary deployments
๐ Future-Proof Architecture
- Multi-Model Serving: Support for various AI/ML frameworks
- Edge Computing: Lightweight deployments for edge scenarios
- Batch Processing: Training and batch inference capabilities
- Multi-Region: Global deployment and data replication
The key to success lies in starting simple, measuring everything, and iteratively improving your deployment based on real-world usage patterns and requirements.
๐ Next Steps
- Start Small: Begin with a single model deployment and gradually add complexity
- Monitor Everything: Implement comprehensive observability from day one
- Automate Relentlessly: Build CI/CD pipelines and automated testing
- Plan for Scale: Design for growth from the beginning
- Stay Secure: Implement security best practices throughout the stack
With this foundation, you're well-equipped to deploy and operate resilient AI applications that can handle production workloads at scale while maintaining high availability, security, and performance.
Need help implementing AI on Kubernetes? Our team specializes in cloud-native AI deployments and can help you build production-ready systems. Contact us for a consultation on your AI infrastructure needs.