This guide demonstrates how to use Sentinel’s traffic mirroring (shadow traffic) feature for safe canary deployments and testing.
Overview
Traffic mirroring duplicates live requests to a shadow upstream for testing purposes, while clients receive responses from the primary upstream. This enables:
- Safe canary deployments - Test new versions with real traffic without user impact
- Performance testing - Validate new infrastructure under production load
- Debug/replay - Capture and test specific request patterns
- Data collection - Gather metrics from shadow deployments
Architecture
┌─────────┐
│ Client │
└────┬────┘
│ Request
▼
┌────────────────┐
│ Sentinel │
└────┬───────┬───┘
│ │
│ └─────► Shadow Request (async, fire-and-forget)
│ │
│ ▼
│ ┌──────────────┐
│ │ Canary │
│ │ Upstream │
│ └──────────────┘
│
│ Primary Request
▼
┌──────────────┐
│ Production │
│ Upstream │
└──────────────┘
│
│ Response
▼
┌─────────┐
│ Client │
└─────────┘
Key points:
- Shadow requests are fire-and-forget (non-blocking)
- Client receives response only from primary upstream
- Shadow failures don’t affect client response
- Zero latency impact on primary request
Quick Start
1. Create Configuration
Create shadow-test.kdl:
schema-version "1.0"
system {
worker-threads 2
max-connections 1000
}
listeners {
listener "http" {
address "0.0.0.0:8080"
protocol "http"
}
}
upstreams {
upstream "production" {
target "127.0.0.1:9001" weight=100
health-check {
path "/health"
interval-secs 10
}
}
upstream "canary" {
target "127.0.0.1:9002" weight=100
health-check {
path "/health"
interval-secs 10
}
}
}
routes {
route "api" {
matches {
path-prefix "/api/"
}
upstream "production"
// Mirror 100% of traffic to canary
shadow {
upstream "canary"
percentage 100.0
timeout-ms 5000
}
}
}
2. Start Upstreams
For testing, you can use simple HTTP servers or Docker containers:
docker-compose.yml:
version: '3.8'
services:
production:
image: nginx:alpine
ports:
- "9001:80"
volumes:
- ./nginx-production.conf:/etc/nginx/conf.d/default.conf
canary:
image: nginx:alpine
ports:
- "9002:80"
volumes:
- ./nginx-canary.conf:/etc/nginx/conf.d/default.conf
nginx-production.conf:
system {
listen 80;
location /health {
return 200 '{"status":"healthy","upstream":"production"}\n';
add_header Content-Type application/json;
}
location /api/ {
return 200 '{"message":"Production upstream","path":"$request_uri"}\n';
add_header Content-Type application/json;
add_header X-Upstream-Name production;
}
}
nginx-canary.conf:
system {
listen 80;
location /health {
return 200 '{"status":"healthy","upstream":"canary"}\n';
add_header Content-Type application/json;
}
location /api/ {
return 200 '{"message":"Canary upstream (v2.0)","path":"$request_uri"}\n';
add_header Content-Type application/json;
add_header X-Upstream-Name canary;
add_header X-Version v2.0;
}
}
3. Start Services
# Start upstreams
# Start Sentinel
4. Test Traffic Mirroring
# Make a request
# Response from production:
# {"message":"Production upstream","path":"/api/users"}
# The same request was also sent to canary (asynchronously)
# but the canary response was not returned to the client
5. Monitor Metrics
# Check shadow metrics
|
# Example output:
# shadow_requests_total{route="api",upstream="canary",result="success"} 1
# shadow_latency_seconds_bucket{route="api",upstream="canary",le="0.1"} 1
Configuration Patterns
Pattern 1: Full Shadow (100% Mirrored)
Use for initial canary testing with comprehensive coverage:
route "api-full" {
matches {
path-prefix "/api/v1"
}
upstream "production"
shadow {
upstream "canary"
percentage 100.0
timeout-ms 5000
}
}
When to use:
- Initial canary deployment
- Validating stability before production rollout
- Short-term testing with small traffic volumes
Pattern 2: Partial Shadow (Sampled)
Use for gradual rollout with lower shadow load:
route "api-sampled" {
matches {
path-prefix "/api/v2"
}
upstream "production"
shadow {
upstream "canary"
percentage 10.0 // Mirror 10% of requests
timeout-ms 5000
}
}
When to use:
- High-traffic APIs where 100% would overload shadow
- Long-running canary deployments
- Representative sampling for metrics collection
Pattern 3: Header-Based Shadow
Use for targeted testing with specific requests:
route "api-debug" {
matches {
path-prefix "/api/v3"
}
upstream "production"
shadow {
upstream "canary"
percentage 100.0
sample-header "X-Debug-Shadow" "true"
timeout-ms 5000
}
}
When to use:
- Developer/QA testing
- Beta user testing
- Debugging specific user flows
- Testing with internal traffic only
Example usage:
# Without header - NOT mirrored
# With header - mirrored to canary
Pattern 4: Multi-Environment Shadow
Shadow to staging for internal testing:
system {
worker-threads 0
}
listeners {
listener "http" {
address "0.0.0.0:8080"
protocol "http"
}
}
upstreams {
upstream "production" { /* ... */ }
upstream "canary" { /* ... */ }
upstream "staging" { /* ... */ }
}
routes {
// External traffic: production → canary (10% sample)
route "public-api" {
matches {
path-prefix "/api/"
header name="X-Internal-Test" invert=#true
}
upstream "production"
shadow {
upstream "canary"
percentage 10.0
}
}
// Internal traffic: production → staging (100%)
route "internal-api" {
matches {
path-prefix "/api/"
header name="X-Internal-Test" value="enabled"
}
upstream "production"
shadow {
upstream "staging"
percentage 100.0
timeout-ms 3000
}
}
}
Pattern 5: POST/PUT with Body Buffering
Mirror requests with body inspection:
route "api-with-body" {
matches {
path "/api/users"
method "POST" "PUT"
}
upstream "production"
shadow {
upstream "canary"
percentage 100.0
buffer-body #true
max-body-bytes 1048576 // 1MB limit
timeout-ms 5000
}
}
⚠️ Warning: Body buffering increases memory usage and adds latency. Use only when necessary and enforce strict size limits.
Monitoring and Observability
Prometheus Metrics
Sentinel exposes the following metrics for shadow traffic:
# Total shadow requests (labels: route, upstream, result)
shadow_requests_total{route="api",upstream="canary",result="success"} 1234
shadow_requests_total{route="api",upstream="canary",result="error"} 5
# Shadow errors by type (labels: route, upstream, error_type)
shadow_errors_total{route="api",upstream="canary",error_type="timeout"} 3
shadow_errors_total{route="api",upstream="canary",error_type="connect_failed"} 2
# Shadow latency histogram (labels: route, upstream)
shadow_latency_seconds_bucket{route="api",upstream="canary",le="0.05"} 800
shadow_latency_seconds_bucket{route="api",upstream="canary",le="0.1"} 980
shadow_latency_seconds_bucket{route="api",upstream="canary",le="0.5"} 1200
shadow_latency_seconds_bucket{route="api",upstream="canary",le="1.0"} 1230
shadow_latency_seconds_sum{route="api",upstream="canary"} 98.5
shadow_latency_seconds_count{route="api",upstream="canary"} 1234
Example Queries
Shadow error rate:
rate(shadow_errors_total[5m]) / rate(shadow_requests_total[5m])
Shadow success rate:
rate(shadow_requests_total{result="success"}[5m]) / rate(shadow_requests_total[5m])
Shadow p99 latency:
histogram_quantile(0.99, rate(shadow_latency_seconds_bucket[5m]))
Alerting
Set up alerts for shadow failures:
# shadow-alerts.yml
groups:
- name: shadow_traffic
rules:
- alert: HighShadowErrorRate
expr: |
rate(shadow_errors_total[5m]) / rate(shadow_requests_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High shadow error rate ({{ $value }}%)"
description: "Shadow upstream {{ $labels.upstream }} has >10% error rate"
- alert: ShadowTimeoutRate
expr: |
rate(shadow_errors_total{error_type="timeout"}[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Shadow timeouts detected"
description: "Shadow upstream {{ $labels.upstream }} experiencing timeouts"
Testing Scenarios
Scenario 1: Canary Deployment Validation
Goal: Validate a new service version before promoting to production.
Setup:
route "api" {
upstream "production" // v1.0
shadow {
upstream "canary" // v2.0
percentage 100.0
}
}
Test plan:
- Deploy canary v2.0
- Enable shadow at 10%
- Monitor canary metrics (errors, latency, logs)
- Gradually increase to 50%, then 100%
- Compare canary vs production metrics
- If stable, promote canary to production
Scenario 2: Performance Testing
Goal: Validate infrastructure can handle production load.
Setup:
route "api" {
upstream "current-infra"
shadow {
upstream "new-infra"
percentage 100.0
}
}
Metrics to compare:
- Request latency (p50, p95, p99)
- Error rates
- Resource usage (CPU, memory, connections)
- Database query performance
Scenario 3: API Refactoring Validation
Goal: Ensure refactored API produces same responses.
Setup:
route "api-v1" {
upstream "legacy-api"
shadow {
upstream "refactored-api"
percentage 100.0
buffer-body #true
max-body-bytes 1048576
}
}
Validation approach:
- Enable shadow to refactored API
- Log responses from both upstreams
- Compare response bodies for discrepancies
- Identify and fix differences
- Switch traffic to refactored API when validated
Best Practices
1. Start Small
Begin with low sampling percentages:
shadow {
upstream "canary"
percentage 1.0 // Start with 1%
}
Gradually increase after validating stability.
2. Configure Appropriate Timeouts
Shadow timeouts should be shorter than primary timeouts:
route "api" {
policies {
timeout-secs 30 // Primary timeout
}
shadow {
upstream "canary"
timeout-ms 20000 // 20s shadow timeout (shorter)
}
}
3. Monitor Shadow Health
Don’t deploy blindly - monitor shadow metrics:
# Check shadow success rate
|
# Check shadow error rate
|
4. Use Header-Based Filtering
For controlled testing:
shadow {
upstream "canary"
sample-header "X-User-Tier" "beta" // Only beta users
}
5. Body Buffering Hygiene
Only buffer when necessary:
shadow {
upstream "canary"
buffer-body #true
max-body-bytes 524288 // 512KB limit (strict)
}
Avoid buffering for:
- File uploads
- Streaming APIs
- High-throughput endpoints
6. Security and Compliance
For sensitive data:
// Exclude PII-heavy endpoints
route "user-data" {
matches {
path-prefix "/api/users/"
}
upstream "production"
// NO shadow block - don't mirror PII
}
// Mirror only non-sensitive endpoints
route "public-data" {
matches {
path-prefix "/api/public/"
}
upstream "production"
shadow {
upstream "canary"
percentage 10.0
}
}
Troubleshooting
Shadow Requests Not Sent
Check:
- Shadow upstream health:
curl http://localhost:9090/metrics | grep upstream_health - Sampling percentage: Ensure > 0
- Header conditions: Verify
sample-headermatches requests - Metrics:
shadow_requests_totalshould be incrementing
High Shadow Error Rate
Check:
- Shadow upstream logs for errors
- Network connectivity: Can Sentinel reach shadow upstream?
- Timeout settings: Are shadow timeouts too short?
- Resource limits: Is shadow upstream under-provisioned?
Memory Issues
Check:
- Body buffering: Is
buffer-bodyenabled unnecessarily? max-body-bytes: Reduce limit- Sampling: Reduce
percentageto lower load
Complete Example
Full configuration with production best practices:
schema-version "1.0"
system {
worker-threads 4
max-connections 10000
}
listeners {
listener "http" {
address "0.0.0.0:8080"
protocol "http"
request-timeout-secs 60
}
}
upstreams {
upstream "production" {
target "prod-api-1.internal:8000" weight=100
target "prod-api-2.internal:8000" weight=100
load-balancing "round-robin"
health-check {
path "/health"
interval-secs 10
timeout-secs 2
healthy-threshold 2
unhealthy-threshold 3
}
}
upstream "canary" {
target "canary-api-1.internal:8000" weight=100
health-check {
path "/health"
interval-secs 10
timeout-secs 2
}
}
}
routes {
// Health check (no shadow)
route "health" {
priority 1000
matches {
path "/health"
}
service-type "builtin"
builtin-handler "health"
}
// API v2 with gradual canary rollout
route "api-v2" {
priority 200
matches {
path-prefix "/api/v2/"
method "GET" "POST" "PUT" "DELETE"
}
upstream "production"
// Shadow 10% to canary
shadow {
upstream "canary"
percentage 10.0
timeout-ms 25000
buffer-body #false
}
filters "auth" "rate-limit"
retry-policy {
max-attempts 3
retryable-status-codes 502 503 504
}
policies {
timeout-secs 30
max-body-size "10MB"
}
}
// Beta users - 100% shadow
route "api-beta" {
priority 250
matches {
path-prefix "/api/v2/"
header name="X-User-Tier" value="beta"
}
upstream "production"
shadow {
upstream "canary"
percentage 100.0
sample-header "X-Enable-Shadow" "true"
timeout-ms 25000
}
}
}
observability {
metrics {
enabled #true
port 9090
}
logging {
level "info"
format "json"
}
}
Next Steps
- Routes Configuration - Detailed route configuration reference
- Upstreams Configuration - Upstream pools and health checks
- Observability - Metrics and logging setup
- Prometheus Example - Metrics collection and visualization