Production monitoring and observability for Sentinel deployments.
Metrics Endpoint
Sentinel exposes Prometheus metrics on the configured address:
observability {
metrics {
enabled #true
address "0.0.0.0:9090"
path "/metrics"
}
}
Verify:
Prometheus Setup
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Sentinel proxy
- job_name: 'sentinel'
static_configs:
- targets:
relabel_configs:
- source_labels:
target_label: instance
regex: '([^:]+):\d+'
replacement: '${1}'
# Sentinel agents
- job_name: 'sentinel-agents'
static_configs:
- targets:
- 'sentinel-waf:9091'
- 'sentinel-auth:9092'
- 'sentinel-ratelimit:9093'
Docker Compose
services:
prometheus:
image: prom/prometheus:v2.47.0
ports:
- "9091:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d'
volumes:
prometheus-data:
Kubernetes ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sentinel
labels:
app: sentinel
spec:
selector:
matchLabels:
app: sentinel
endpoints:
- port: metrics
interval: 15s
path: /metrics
Key Metrics
Request Metrics
| Metric | Type | Description |
|---|---|---|
sentinel_requests_total | Counter | Total requests by route, method, status |
sentinel_request_duration_seconds | Histogram | Request latency distribution |
sentinel_request_size_bytes | Histogram | Request body size |
sentinel_response_size_bytes | Histogram | Response body size |
Upstream Metrics
| Metric | Type | Description |
|---|---|---|
sentinel_upstream_requests_total | Counter | Requests per upstream target |
sentinel_upstream_latency_seconds | Histogram | Upstream response time |
sentinel_upstream_health | Gauge | Target health (1=healthy, 0=unhealthy) |
sentinel_upstream_connections_active | Gauge | Active connections per upstream |
Agent Metrics
| Metric | Type | Description |
|---|---|---|
sentinel_agent_duration_seconds | Histogram | Agent processing time |
sentinel_agent_errors_total | Counter | Agent errors by type |
sentinel_agent_decisions_total | Counter | Agent decisions (allow/block) |
System Metrics
| Metric | Type | Description |
|---|---|---|
sentinel_connections_active | Gauge | Active client connections |
sentinel_connections_total | Counter | Total connections |
process_cpu_seconds_total | Counter | CPU usage |
process_resident_memory_bytes | Gauge | Memory usage |
Essential PromQL Queries
Request Rate
# Requests per second
rate(sentinel_requests_total[5m])
# By route
sum by (route) (rate(sentinel_requests_total[5m]))
# By status code
sum by (status) (rate(sentinel_requests_total[5m]))
Error Rate
# 5xx error rate
sum(rate(sentinel_requests_total{status=~"5.."}[5m]))
/ sum(rate(sentinel_requests_total[5m])) * 100
# 4xx rate
sum(rate(sentinel_requests_total{status=~"4.."}[5m]))
/ sum(rate(sentinel_requests_total[5m])) * 100
Latency
# 50th percentile
histogram_quantile(0.50, rate(sentinel_request_duration_seconds_bucket[5m]))
# 95th percentile
histogram_quantile(0.95, rate(sentinel_request_duration_seconds_bucket[5m]))
# 99th percentile
histogram_quantile(0.99, rate(sentinel_request_duration_seconds_bucket[5m]))
Upstream Health
# Unhealthy upstreams
sentinel_upstream_health == 0
# Upstream latency p95
histogram_quantile(0.95, rate(sentinel_upstream_latency_seconds_bucket[5m]))
Alerting Rules
alerts.yml
groups:
- name: sentinel
rules:
# High error rate
- alert: SentinelHighErrorRate
expr: |
sum(rate(sentinel_requests_total{status=~"5.."}[5m]))
/ sum(rate(sentinel_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on Sentinel"
description: "Error rate is {{ $value | humanizePercentage }}"
# High latency
- alert: SentinelHighLatency
expr: |
histogram_quantile(0.95, rate(sentinel_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on Sentinel"
description: "p95 latency is {{ $value | humanizeDuration }}"
# Upstream down
- alert: SentinelUpstreamDown
expr: sentinel_upstream_health == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Upstream target is down"
description: "{{ $labels.upstream }}/{{ $labels.target }} is unhealthy"
# No requests
- alert: SentinelNoTraffic
expr: |
sum(rate(sentinel_requests_total[5m])) == 0
for: 5m
labels:
severity: warning
annotations:
summary: "No traffic to Sentinel"
description: "Sentinel has received no requests in 5 minutes"
# Agent errors
- alert: SentinelAgentErrors
expr: rate(sentinel_agent_errors_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Agent errors detected"
description: "Agent {{ $labels.agent }} has errors"
# High memory
- alert: SentinelHighMemory
expr: |
process_resident_memory_bytes / 1024 / 1024 > 1024
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Sentinel using {{ $value | humanize }}MB"
Grafana Dashboards
Dashboard JSON
Health Checks
Sentinel Health Endpoint
# Simple health check
# Response
}
Detailed Health
# Response
{
}
Kubernetes Probes
apiVersion: v1
kind: Pod
spec:
containers:
- name: sentinel
livenessProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/detailed
port: 9090
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
Logging
Structured Logging
system {
worker-threads 0
}
listeners {
listener "http" {
address "0.0.0.0:8080"
protocol "http"
}
}
routes {
route "default" {
matches { path-prefix "/" }
upstream "backend"
}
}
upstreams {
upstream "backend" {
targets {
target { address "127.0.0.1:3000" }
}
}
}
Log Output
Log Aggregation with Loki
# promtail.yml
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: sentinel
static_configs:
- targets:
- localhost
labels:
job: sentinel
__path__: /var/log/sentinel/*.log
pipeline_stages:
- json:
expressions:
level: level
status: status
- labels:
level:
status:
Distributed Tracing
OpenTelemetry Configuration
system {
worker-threads 0
}
listeners {
listener "http" {
address "0.0.0.0:8080"
protocol "http"
}
}
routes {
route "default" {
matches { path-prefix "/" }
upstream "backend"
}
}
upstreams {
upstream "backend" {
targets {
target { address "127.0.0.1:3000" }
}
}
}
Jaeger Setup
services:
jaeger:
image: jaegertracing/all-in-one:1.50
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
SLA Monitoring
SLI/SLO Dashboard
# Availability SLI (non-5xx responses)
sum(rate(sentinel_requests_total{status!~"5.."}[5m]))
/ sum(rate(sentinel_requests_total[5m]))
# Latency SLI (requests under 200ms)
sum(rate(sentinel_request_duration_seconds_bucket{le="0.2"}[5m]))
/ sum(rate(sentinel_request_duration_seconds_count[5m]))
# Error budget remaining (99.9% SLO)
1 - (
(1 - (sum(rate(sentinel_requests_total{status!~"5.."}[30d]))
/ sum(rate(sentinel_requests_total[30d]))))
/ (1 - 0.999)
)
Next Steps
- Rolling Updates - Zero-downtime updates
- Kubernetes - Cloud-native deployment
- Docker Compose - Container orchestration