Procedures for responding to production incidents affecting Sentinel.
Incident Classification
Severity Levels
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| SEV1 | Complete outage, all traffic affected | Immediate (< 5 min) | Proxy down, all upstreams unreachable |
| SEV2 | Partial outage, significant traffic affected | < 15 min | Multiple routes failing, > 10% error rate |
| SEV3 | Degraded performance, limited impact | < 1 hour | Elevated latency, single upstream unhealthy |
| SEV4 | Minor issue, minimal user impact | < 4 hours | Non-critical feature degraded |
Escalation Matrix
| Severity | Primary | Secondary | Management |
|---|---|---|---|
| SEV1 | On-call engineer | Team lead | Director (if > 30 min) |
| SEV2 | On-call engineer | Team lead | - |
| SEV3 | On-call engineer | - | - |
| SEV4 | Next business day | - | - |
Initial Response
First 5 Minutes Checklist
- Acknowledge incident in alerting system
- Assess severity using classification above
- Open incident channel
- Declare incident commander if SEV1/SEV2
- Begin gathering initial diagnostics
Quick Diagnostic Commands
# Check proxy health
&& ||
# Check ready status
&& ||
# Get current error rate (last 5 min)
| | \
# Check process status
# Check recent logs for errors
| |
# Check upstream health
|
# Check circuit breakers
|
Initial Triage Decision Tree
Is the proxy process running?
├─ NO → Go to: Process Crash Procedure
└─ YES
└─ Is the health endpoint responding?
├─ NO → Go to: Health Check Failure Procedure
└─ YES
└─ Are all upstreams healthy?
├─ NO → Go to: Upstream Failure Procedure
└─ YES
└─ Is error rate elevated?
├─ YES → Go to: High Error Rate Procedure
└─ NO → Go to: Performance Degradation Procedure
Incident Procedures
Process Crash
Symptoms: Sentinel process not running, connections refused
Immediate Actions:
# 1. Attempt restart
# 2. Check if it stays up
&&
# 3. If still failing, check logs for crash reason
|
# 4. Check for resource exhaustion
| |
Common Causes & Fixes:
| Cause | Diagnostic | Fix |
|---|---|---|
| OOM killed | dmesg | grep oom shows sentinel | Increase memory limits |
| Config error | Logs show parse/validation error | Restore previous config |
| Disk full | df -h shows 100% | Clear logs, increase disk |
| Port conflict | Logs show “address in use” | Kill conflicting process |
| Certificate expired | TLS handshake errors | Renew certificates |
Rollback:
# Restore last known good config
Upstream Failure
Symptoms: Specific routes returning 502/503, upstream health metrics showing 0
Immediate Actions:
# 1. Identify unhealthy upstreams
| |
# 2. Check upstream connectivity from proxy host
# 3. Check DNS resolution
# 4. Check network path
Mitigation Options:
- Remove unhealthy targets temporarily - edit config to comment out the target and reload
- Adjust health check thresholds - increase
unhealthy-thresholdortimeout-secs - Enable failover upstream - add
fallback-upstreamto the route
High Error Rate
Symptoms: > 1% 5xx error rate, elevated latency
Immediate Actions:
# 1. Identify error distribution by route
| |
# 2. Check for specific error types
| \
| | | |
# 3. Check upstream latency
|
Error Type Actions:
| Error | Cause | Action |
|---|---|---|
| 502 Bad Gateway | Upstream returning invalid response | Check upstream application logs |
| 503 Service Unavailable | All targets unhealthy or circuit breaker open | Follow Upstream Failure procedure |
| 504 Gateway Timeout | Upstream not responding in time | Increase timeout temporarily |
| 500 Internal Server Error | Proxy internal error | Check proxy logs, restart if persistent |
Memory Exhaustion
Symptoms: High memory usage, slow responses, potential OOM
Immediate Actions:
# 1. Check current memory usage
|
# 2. Check connection count
|
# 3. Check request queue depth
|
Mitigation:
// Reduce connection limits immediately
limits {
max-connections 5000
max-connections-per-client 50
}
Then reload with kill -HUP $(cat /var/run/sentinel.pid).
TLS/Certificate Issues
Symptoms: TLS handshake failures, certificate errors in logs
Diagnostic Commands:
# Check certificate expiration
# Verify certificate chain
# Check certificate matches key
# Test TLS connection
Certificate Renewal:
# Deploy new certificate
# Reload (zero-downtime)
DDoS/Attack Response
Symptoms: Massive traffic spike, resource exhaustion
Immediate Actions:
# Identify top client IPs
| \
| | | |
# Check for attack patterns
| \
| | | |
Mitigation:
- Enable aggressive rate limiting in config
- Block attacking IPs via firewall:
iptables -A INPUT -s $ATTACKER_IP -j DROP - Reduce resource limits to preserve availability
Post-Incident
Immediate Actions (< 1 hour after resolution)
- Update status page to “Resolved”
- Send all-clear communication
- Document timeline in incident channel
- Preserve logs and metrics snapshots
- Schedule post-mortem (SEV1/SEV2: within 48 hours)
Log Preservation
INCIDENT_ID="INC--001"
# Save logs
# Save metrics snapshot
# Save config at time of incident
Post-Mortem Template
- ---
- -
Quick Reference
Critical Commands
# Health check
# Reload config
# Graceful restart
# View errors
| |
# Check upstreams
|
Key Metrics to Check First
sentinel_requests_total{status="5xx"}- Error countsentinel_upstream_health- Upstream availabilitysentinel_request_duration_seconds- Latencysentinel_open_connections- Connection countsentinel_circuit_breaker_state- Circuit breaker status
See Also
- Troubleshooting - Common issue resolution
- Health Monitoring - Health checks and alerting
- Metrics Reference - Available metrics