Incident Response

Procedures for responding to production incidents affecting Sentinel.

Incident Classification

Severity Levels

SeverityDescriptionResponse TimeExamples
SEV1Complete outage, all traffic affectedImmediate (< 5 min)Proxy down, all upstreams unreachable
SEV2Partial outage, significant traffic affected< 15 minMultiple routes failing, > 10% error rate
SEV3Degraded performance, limited impact< 1 hourElevated latency, single upstream unhealthy
SEV4Minor issue, minimal user impact< 4 hoursNon-critical feature degraded

Escalation Matrix

SeverityPrimarySecondaryManagement
SEV1On-call engineerTeam leadDirector (if > 30 min)
SEV2On-call engineerTeam lead-
SEV3On-call engineer--
SEV4Next business day--

Initial Response

First 5 Minutes Checklist

  • Acknowledge incident in alerting system
  • Assess severity using classification above
  • Open incident channel
  • Declare incident commander if SEV1/SEV2
  • Begin gathering initial diagnostics

Quick Diagnostic Commands

# Check proxy health
curl -sf http://localhost:8080/health && echo "OK" || echo "UNHEALTHY"

# Check ready status
curl -sf http://localhost:8080/ready && echo "READY" || echo "NOT READY"

# Get current error rate (last 5 min)
curl -s localhost:9090/metrics | grep 'requests_total' | \
    awk '/5[0-9][0-9]/ {sum+=$2} END {print "5xx errors:", sum}'

# Check process status
systemctl status sentinel

# Check recent logs for errors
journalctl -u sentinel --since "5 minutes ago" | grep -i error | tail -20

# Check upstream health
curl -s localhost:9090/metrics | grep upstream_health

# Check circuit breakers
curl -s localhost:9090/metrics | grep circuit_breaker_state

Initial Triage Decision Tree

Is the proxy process running?
├─ NO → Go to: Process Crash Procedure
└─ YES
    └─ Is the health endpoint responding?
        ├─ NO → Go to: Health Check Failure Procedure
        └─ YES
            └─ Are all upstreams healthy?
                ├─ NO → Go to: Upstream Failure Procedure
                └─ YES
                    └─ Is error rate elevated?
                        ├─ YES → Go to: High Error Rate Procedure
                        └─ NO → Go to: Performance Degradation Procedure

Incident Procedures

Process Crash

Symptoms: Sentinel process not running, connections refused

Immediate Actions:

# 1. Attempt restart
systemctl restart sentinel

# 2. Check if it stays up
sleep 5 && systemctl status sentinel

# 3. If still failing, check logs for crash reason
journalctl -u sentinel --since "10 minutes ago" | tail -100

# 4. Check for resource exhaustion
dmesg | grep -i "oom\|killed" | tail -10
free -h
df -h /var /tmp

Common Causes & Fixes:

CauseDiagnosticFix
OOM killeddmesg | grep oom shows sentinelIncrease memory limits
Config errorLogs show parse/validation errorRestore previous config
Disk fulldf -h shows 100%Clear logs, increase disk
Port conflictLogs show “address in use”Kill conflicting process
Certificate expiredTLS handshake errorsRenew certificates

Rollback:

# Restore last known good config
cp /etc/sentinel/config.kdl.backup /etc/sentinel/config.kdl
systemctl restart sentinel

Upstream Failure

Symptoms: Specific routes returning 502/503, upstream health metrics showing 0

Immediate Actions:

# 1. Identify unhealthy upstreams
curl -s localhost:9090/metrics | grep 'upstream_health{' | grep ' 0'

# 2. Check upstream connectivity from proxy host
nc -zv upstream-host 8080

# 3. Check DNS resolution
dig +short backend.service.internal

# 4. Check network path
traceroute -n <upstream_ip>

Mitigation Options:

  1. Remove unhealthy targets temporarily - edit config to comment out the target and reload
  2. Adjust health check thresholds - increase unhealthy-threshold or timeout-secs
  3. Enable failover upstream - add fallback-upstream to the route

High Error Rate

Symptoms: > 1% 5xx error rate, elevated latency

Immediate Actions:

# 1. Identify error distribution by route
curl -s localhost:9090/metrics | grep 'requests_total.*status="5' | sort -t'"' -k4

# 2. Check for specific error types
journalctl -u sentinel --since "5 minutes ago" | \
    grep -oP 'error[^,]*' | sort | uniq -c | sort -rn | head

# 3. Check upstream latency
curl -s localhost:9090/metrics | grep 'upstream_request_duration.*quantile="0.99"'

Error Type Actions:

ErrorCauseAction
502 Bad GatewayUpstream returning invalid responseCheck upstream application logs
503 Service UnavailableAll targets unhealthy or circuit breaker openFollow Upstream Failure procedure
504 Gateway TimeoutUpstream not responding in timeIncrease timeout temporarily
500 Internal Server ErrorProxy internal errorCheck proxy logs, restart if persistent

Memory Exhaustion

Symptoms: High memory usage, slow responses, potential OOM

Immediate Actions:

# 1. Check current memory usage
curl -s localhost:9090/metrics | grep process_resident_memory_bytes

# 2. Check connection count
curl -s localhost:9090/metrics | grep open_connections

# 3. Check request queue depth
curl -s localhost:9090/metrics | grep pending_requests

Mitigation:

// Reduce connection limits immediately
limits {
    max-connections 5000
    max-connections-per-client 50
}

Then reload with kill -HUP $(cat /var/run/sentinel.pid).

TLS/Certificate Issues

Symptoms: TLS handshake failures, certificate errors in logs

Diagnostic Commands:

# Check certificate expiration
openssl x509 -in /etc/sentinel/certs/server.crt -noout -dates

# Verify certificate chain
openssl verify -CAfile /etc/sentinel/certs/ca.crt /etc/sentinel/certs/server.crt

# Check certificate matches key
diff <(openssl x509 -in server.crt -noout -modulus) \
     <(openssl rsa -in server.key -noout -modulus)

# Test TLS connection
openssl s_client -connect localhost:443 -servername your.domain.com </dev/null

Certificate Renewal:

# Deploy new certificate
cp /path/to/new/cert.crt /etc/sentinel/certs/server.crt
cp /path/to/new/key.key /etc/sentinel/certs/server.key
chmod 600 /etc/sentinel/certs/server.key

# Reload (zero-downtime)
kill -HUP $(cat /var/run/sentinel.pid)

DDoS/Attack Response

Symptoms: Massive traffic spike, resource exhaustion

Immediate Actions:

# Identify top client IPs
journalctl -u sentinel --since "5 minutes ago" -o json | \
    jq -r '.client_ip' | sort | uniq -c | sort -rn | head -20

# Check for attack patterns
journalctl -u sentinel --since "5 minutes ago" | \
    grep -oP 'path="[^"]*"' | sort | uniq -c | sort -rn | head -20

Mitigation:

  1. Enable aggressive rate limiting in config
  2. Block attacking IPs via firewall: iptables -A INPUT -s $ATTACKER_IP -j DROP
  3. Reduce resource limits to preserve availability

Post-Incident

Immediate Actions (< 1 hour after resolution)

  • Update status page to “Resolved”
  • Send all-clear communication
  • Document timeline in incident channel
  • Preserve logs and metrics snapshots
  • Schedule post-mortem (SEV1/SEV2: within 48 hours)

Log Preservation

INCIDENT_ID="INC-$(date +%Y%m%d)-001"
mkdir -p /var/log/sentinel/incidents/$INCIDENT_ID

# Save logs
journalctl -u sentinel --since "1 hour ago" > \
    /var/log/sentinel/incidents/$INCIDENT_ID/sentinel.log

# Save metrics snapshot
curl -s localhost:9090/metrics > \
    /var/log/sentinel/incidents/$INCIDENT_ID/metrics.txt

# Save config at time of incident
cp /etc/sentinel/config.kdl /var/log/sentinel/incidents/$INCIDENT_ID/

Post-Mortem Template

# Incident Post-Mortem: [INCIDENT_ID]

## Summary
- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes
- **Severity**: SEVN
- **Impact**: [Brief description of user impact]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | First alert triggered |
| HH:MM | Incident declared |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Full resolution |

## Root Cause
[Detailed explanation]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action item] | [Owner] | YYYY-MM-DD | Open |

## Lessons Learned
- What went well:
- What could be improved:

Quick Reference

Critical Commands

# Health check
curl -sf localhost:8080/health

# Reload config
kill -HUP $(cat /var/run/sentinel.pid)

# Graceful restart
systemctl restart sentinel

# View errors
journalctl -u sentinel | grep ERROR | tail -20

# Check upstreams
curl -s localhost:9090/metrics | grep upstream_health

Key Metrics to Check First

  1. sentinel_requests_total{status="5xx"} - Error count
  2. sentinel_upstream_health - Upstream availability
  3. sentinel_request_duration_seconds - Latency
  4. sentinel_open_connections - Connection count
  5. sentinel_circuit_breaker_state - Circuit breaker status

See Also