Monitoring Sentinel health, readiness, and upstream status.
Health Endpoints
Liveness Check
The /health endpoint returns 200 OK if Sentinel is running:
Response:
Configure the health route:
routes {
route "health" {
priority 1000
matches {
path "/health"
}
service-type "builtin"
builtin-handler "health"
}
}
Status Endpoint
The /status endpoint returns detailed status:
Response:
Upstream Health
Check upstream health status:
Response:
Kubernetes Probes
Liveness Probe
Detect if Sentinel needs restart:
livenessProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Readiness Probe
Detect if Sentinel is ready to receive traffic:
readinessProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Startup Probe
For slow-starting instances:
startupProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30 # 150 seconds max startup
Complete Kubernetes Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: sentinel
spec:
replicas: 3
selector:
matchLabels:
app: sentinel
template:
metadata:
labels:
app: sentinel
spec:
containers:
- name: sentinel
image: sentinel:latest
ports:
- name: http
containerPort: 8080
- name: admin
containerPort: 9090
livenessProbe:
httpGet:
path: /health
port: admin
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: admin
initialDelaySeconds: 5
periodSeconds: 5
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
Load Balancer Health Checks
AWS ALB/NLB
Target type: instance or ip
Health check path: /health
Health check port: 9090
Healthy threshold: 2
Unhealthy threshold: 3
Timeout: 5 seconds
Interval: 10 seconds
Success codes: 200
GCP Load Balancer
healthChecks:
- name: sentinel-health
type: HTTP
httpHealthCheck:
port: 9090
requestPath: /health
checkIntervalSec: 10
timeoutSec: 5
healthyThreshold: 2
unhealthyThreshold: 3
HAProxy Backend Check
backend sentinel_backend
option httpchk GET /health
http-check expect status 200
server sentinel1 10.0.1.1:8080 check port 9090
server sentinel2 10.0.1.2:8080 check port 9090
Upstream Health Checks
HTTP Health Check
upstreams {
upstream "backend" {
health-check {
type "http" {
path "/health"
expected-status 200
host "backend.internal"
}
interval-secs 10
timeout-secs 5
healthy-threshold 2
unhealthy-threshold 3
}
}
}
TCP Health Check
For non-HTTP services:
upstreams {
upstream "database" {
health-check {
type "tcp"
interval-secs 5
timeout-secs 2
healthy-threshold 2
unhealthy-threshold 3
}
}
}
gRPC Health Check
upstreams {
upstream "grpc-service" {
health-check {
type "grpc" {
service "grpc.health.v1.Health"
}
interval-secs 10
timeout-secs 5
}
}
}
Inference Health Check
For LLM/AI inference backends, use the inference health check to verify specific models are loaded and available. This goes beyond a simple HTTP 200 check by parsing the /v1/models endpoint response and confirming expected models are present:
upstreams {
upstream "gpu-cluster" {
health-check {
type "inference" {
endpoint "/v1/models"
expected-models "llama-3-70b" "codellama-34b"
}
interval-secs 30
timeout-secs 10
healthy-threshold 2
unhealthy-threshold 3
}
}
}
The inference health check:
- Sends a GET request to the models endpoint (OpenAI-compatible
/v1/modelsor Ollama/api/tags) - Parses the JSON response to extract available model IDs
- Verifies all expected models are present (supports prefix matching for versioned models like
gpt-4matchinggpt-4-turbo) - Marks the backend unhealthy if any expected model is missing
This is particularly useful for GPU backends where models may need time to load after restart, or when running multiple model variants across a cluster.
Health Check Tuning
| Scenario | interval | timeout | healthy | unhealthy |
|---|---|---|---|---|
| Fast failover | 5s | 2s | 2 | 2 |
| Default | 10s | 5s | 2 | 3 |
| Stable (reduce flapping) | 30s | 10s | 3 | 5 |
| Slow backends | 30s | 15s | 2 | 3 |
Monitoring Key Metrics
Request Metrics
# Request rate
rate(sentinel_requests_total[5m])
# Error rate
sum(rate(sentinel_requests_total{status=~"5.."}[5m]))
/ sum(rate(sentinel_requests_total[5m]))
# P99 latency
histogram_quantile(0.99,
rate(sentinel_request_duration_seconds_bucket[5m]))
Upstream Metrics
# Upstream failure rate
sum(rate(sentinel_upstream_failures_total[5m])) by (upstream)
/ sum(rate(sentinel_upstream_attempts_total[5m])) by (upstream)
# Circuit breaker status (1 = open)
sentinel_circuit_breaker_state{component="upstream"}
# Connection pool utilization
(sentinel_connection_pool_size - sentinel_connection_pool_idle)
/ sentinel_connection_pool_size
System Metrics
# Memory usage
sentinel_memory_usage_bytes
# Active connections
sentinel_open_connections
# Active requests
sentinel_active_requests
Alerting
Critical Alerts
groups:
- name: sentinel-critical
rules:
# High error rate
- alert: SentinelHighErrorRate
expr: |
sum(rate(sentinel_requests_total{status=~"5.."}[5m]))
/ sum(rate(sentinel_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Sentinel error rate above 5%"
# All upstreams unhealthy
- alert: SentinelNoHealthyUpstreams
expr: |
sum(sentinel_circuit_breaker_state{component="upstream"})
== count(sentinel_circuit_breaker_state{component="upstream"})
for: 1m
labels:
severity: critical
annotations:
summary: "No healthy upstream servers"
# Sentinel down
- alert: SentinelDown
expr: up{job="sentinel"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Sentinel instance down"
Warning Alerts
groups:
- name: sentinel-warning
rules:
# High latency
- alert: SentinelHighLatency
expr: |
histogram_quantile(0.99,
rate(sentinel_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 1 second"
# Circuit breaker open
- alert: SentinelCircuitBreakerOpen
expr: sentinel_circuit_breaker_state == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Circuit breaker open for {{ $labels.component }}"
# High memory usage
- alert: SentinelHighMemory
expr: |
sentinel_memory_usage_bytes
/ on() node_memory_MemTotal_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage above 80%"
Dashboards
Key Panels
-
Traffic Overview
- Request rate (RPS)
- Error rate (%)
- Active requests
-
Latency
- P50, P95, P99 latency
- Latency by route
-
Upstream Health
- Upstream status (healthy/unhealthy)
- Connection pool utilization
- Circuit breaker states
-
System Resources
- Memory usage
- CPU usage
- Open connections
Grafana Variables
# Datasource
datasource: prometheus
# Variables
- name: instance
query: label_values(sentinel_requests_total, instance)
- name: route
query: label_values(sentinel_requests_total, route)
- name: upstream
query: label_values(sentinel_upstream_attempts_total, upstream)
External Health Monitoring
Synthetic Monitoring
Use external monitors to verify end-to-end health:
# Simple availability check
||
# Response time check
response_time=
if ; then
fi
Recommended Tools
- Uptime monitoring: Pingdom, UptimeRobot, Datadog Synthetics
- APM: Datadog, New Relic, Dynatrace
- Logs: Elasticsearch/Kibana, Loki/Grafana, Splunk
See Also
- Troubleshooting - Diagnosing issues
- Metrics Reference - All available metrics
- Deployment - Production deployment guides