Prometheus metrics exposed by Sentinel for monitoring and alerting.
Metrics Endpoint
Metrics are available at the /metrics endpoint on the admin listener:
Configure the admin listener:
listeners {
listener "admin" {
address "127.0.0.1:9090"
protocol "http"
}
}
routes {
route "metrics" {
matches {
path "/metrics"
}
service-type "builtin"
builtin-handler "metrics"
}
}
Request Metrics
sentinel_request_duration_seconds
Request latency histogram.
| Type | Labels | Description |
|---|---|---|
| Histogram | route, method | Request duration in seconds |
Buckets: 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s
Example queries:
# Average latency by route
rate(sentinel_request_duration_seconds_sum[5m])
/ rate(sentinel_request_duration_seconds_count[5m])
# P99 latency
histogram_quantile(0.99,
rate(sentinel_request_duration_seconds_bucket[5m]))
# P95 latency by route
histogram_quantile(0.95,
sum(rate(sentinel_request_duration_seconds_bucket[5m])) by (le, route))
sentinel_requests_total
Total request counter.
| Type | Labels | Description |
|---|---|---|
| Counter | route, method, status | Total requests |
Example queries:
# Requests per second
rate(sentinel_requests_total[5m])
# Error rate (5xx)
sum(rate(sentinel_requests_total{status=~"5.."}[5m]))
/ sum(rate(sentinel_requests_total[5m]))
# Success rate by route
sum(rate(sentinel_requests_total{status="200"}[5m])) by (route)
/ sum(rate(sentinel_requests_total[5m])) by (route)
sentinel_active_requests
Currently active requests.
| Type | Labels | Description |
|---|---|---|
| Gauge | - | Number of in-flight requests |
Example queries:
# Current active requests
sentinel_active_requests
# Alert if too high
sentinel_active_requests > 1000
sentinel_request_body_size_bytes
Request body size histogram.
| Type | Labels | Description |
|---|---|---|
| Histogram | route | Request body size in bytes |
Buckets: 100B, 1KB, 10KB, 100KB, 1MB, 10MB, 100MB
sentinel_response_body_size_bytes
Response body size histogram.
| Type | Labels | Description |
|---|---|---|
| Histogram | route | Response body size in bytes |
Scoped Metrics
When using namespaces and services, additional metrics with scope labels are available.
sentinel_scoped_request_duration_seconds
Request latency with namespace/service labels.
| Type | Labels | Description |
|---|---|---|
| Histogram | namespace, service, route, method | Request duration in seconds |
Example queries:
# P99 latency by namespace
histogram_quantile(0.99,
sum(rate(sentinel_scoped_request_duration_seconds_bucket[5m])) by (le, namespace))
# Compare latency across services
histogram_quantile(0.95,
sum(rate(sentinel_scoped_request_duration_seconds_bucket[5m])) by (le, namespace, service))
sentinel_scoped_requests_total
Request counter with namespace/service labels.
| Type | Labels | Description |
|---|---|---|
| Counter | namespace, service, route, method, status | Total requests |
Example queries:
# Request rate by namespace
sum(rate(sentinel_scoped_requests_total[5m])) by (namespace)
# Error rate by service
sum(rate(sentinel_scoped_requests_total{status=~"5.."}[5m])) by (namespace, service)
/ sum(rate(sentinel_scoped_requests_total[5m])) by (namespace, service)
# Top 10 busiest services
topk(10, sum(rate(sentinel_scoped_requests_total[5m])) by (namespace, service))
sentinel_scoped_active_requests
Active requests gauge with namespace/service labels.
| Type | Labels | Description |
|---|---|---|
| Gauge | namespace, service | In-flight requests per scope |
sentinel_scoped_upstream_attempts_total
Upstream attempts with scope labels.
| Type | Labels | Description |
|---|---|---|
| Counter | namespace, service, upstream, route | Connection attempts |
sentinel_scoped_upstream_failures_total
Upstream failures with scope labels.
| Type | Labels | Description |
|---|---|---|
| Counter | namespace, service, upstream, route, reason | Connection failures |
Example queries:
# Failure rate by namespace
sum(rate(sentinel_scoped_upstream_failures_total[5m])) by (namespace)
/ sum(rate(sentinel_scoped_upstream_attempts_total[5m])) by (namespace)
sentinel_scoped_rate_limit_hits_total
Rate limit hits with scope labels.
| Type | Labels | Description |
|---|---|---|
| Counter | namespace, service, route, policy | Rate limit violations |
Example queries:
# Rate limit hits by namespace
sum(rate(sentinel_scoped_rate_limit_hits_total[5m])) by (namespace)
# Services hitting rate limits
sum(rate(sentinel_scoped_rate_limit_hits_total[5m])) by (namespace, service) > 0
sentinel_scoped_circuit_breaker_state
Circuit breaker state with scope labels.
| Type | Labels | Description |
|---|---|---|
| Gauge | namespace, service, upstream | State: 0=closed, 1=open |
Example queries:
# Open circuit breakers by namespace
sentinel_scoped_circuit_breaker_state == 1
# Count of open circuit breakers per namespace
count(sentinel_scoped_circuit_breaker_state == 1) by (namespace)
Upstream Metrics
sentinel_upstream_attempts_total
Upstream connection attempts.
| Type | Labels | Description |
|---|---|---|
| Counter | upstream, route | Total connection attempts |
sentinel_upstream_failures_total
Upstream connection failures.
| Type | Labels | Description |
|---|---|---|
| Counter | upstream, route, reason | Total failures |
Reason values:
connection_refused- TCP connection refusedconnection_timeout- Connection timed outread_timeout- Read timeoutwrite_timeout- Write timeouttls_error- TLS handshake faileddns_error- DNS resolution failed
Example queries:
# Failure rate by upstream
sum(rate(sentinel_upstream_failures_total[5m])) by (upstream)
/ sum(rate(sentinel_upstream_attempts_total[5m])) by (upstream)
# Connection refused errors
sum(rate(sentinel_upstream_failures_total{reason="connection_refused"}[5m])) by (upstream)
sentinel_circuit_breaker_state
Circuit breaker state.
| Type | Labels | Description |
|---|---|---|
| Gauge | component, route | State: 0=closed, 1=open |
Example queries:
# Open circuit breakers
sentinel_circuit_breaker_state == 1
# Alert on circuit breaker open
sentinel_circuit_breaker_state{component="upstream"} == 1
Agent Metrics
sentinel_agent_latency_seconds
Agent call latency histogram.
| Type | Labels | Description |
|---|---|---|
| Histogram | agent, event | Agent call duration |
Event values:
on_request_headerson_request_bodyon_response_headerson_response_body
Example queries:
# P99 agent latency
histogram_quantile(0.99,
rate(sentinel_agent_latency_seconds_bucket[5m]))
# Average latency by agent
rate(sentinel_agent_latency_seconds_sum[5m])
/ rate(sentinel_agent_latency_seconds_count[5m])
sentinel_agent_timeouts_total
Agent call timeouts.
| Type | Labels | Description |
|---|---|---|
| Counter | agent, event | Total timeouts |
Example queries:
# Timeout rate by agent
rate(sentinel_agent_timeouts_total[5m])
# Alert on high timeout rate
rate(sentinel_agent_timeouts_total[5m]) > 0.1
sentinel_blocked_requests_total
Requests blocked by agents/WAF.
| Type | Labels | Description |
|---|---|---|
| Counter | reason | Total blocked requests |
Reason values:
waf- Blocked by WAFauth- Authentication failedrate_limit- Rate limitedpolicy- Policy violation
Connection Pool Metrics
sentinel_connection_pool_size
Total connections in pool.
| Type | Labels | Description |
|---|---|---|
| Gauge | upstream | Total connections |
sentinel_connection_pool_idle
Idle connections in pool.
| Type | Labels | Description |
|---|---|---|
| Gauge | upstream | Idle connections |
sentinel_connection_pool_acquired_total
Connections acquired from pool.
| Type | Labels | Description |
|---|---|---|
| Counter | upstream | Total acquisitions |
Example queries:
# Pool utilization
(sentinel_connection_pool_size - sentinel_connection_pool_idle)
/ sentinel_connection_pool_size
# Connection acquisition rate
rate(sentinel_connection_pool_acquired_total[5m])
TLS Metrics
sentinel_tls_handshake_duration_seconds
TLS handshake duration.
| Type | Labels | Description |
|---|---|---|
| Histogram | version | Handshake duration |
Version values: TLS1.2, TLS1.3
System Metrics
sentinel_memory_usage_bytes
Process memory usage.
| Type | Labels | Description |
|---|---|---|
| Gauge | - | Memory usage in bytes |
sentinel_cpu_usage_percent
CPU usage percentage.
| Type | Labels | Description |
|---|---|---|
| Gauge | - | CPU usage 0-100 |
sentinel_open_connections
Open connections count.
| Type | Labels | Description |
|---|---|---|
| Gauge | - | Number of open connections |
Prometheus Configuration
Basic Scrape Config
scrape_configs:
- job_name: 'sentinel'
static_configs:
- targets:
scrape_interval: 15s
metrics_path: /metrics
With Service Discovery
scrape_configs:
- job_name: 'sentinel'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
regex: sentinel
action: keep
- source_labels:
regex: metrics
action: keep
Alerting Rules
Example Alerts
groups:
- name: sentinel
rules:
# High error rate
- alert: SentinelHighErrorRate
expr: |
sum(rate(sentinel_requests_total{status=~"5.."}[5m]))
/ sum(rate(sentinel_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on Sentinel"
description: "Error rate is {{ $value | humanizePercentage }}"
# Circuit breaker open
- alert: SentinelCircuitBreakerOpen
expr: sentinel_circuit_breaker_state == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Circuit breaker open"
description: "Circuit breaker open for {{ $labels.component }}"
# High latency
- alert: SentinelHighLatency
expr: |
histogram_quantile(0.99,
rate(sentinel_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency"
description: "P99 latency is {{ $value }}s"
# Agent timeouts
- alert: SentinelAgentTimeouts
expr: rate(sentinel_agent_timeouts_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Agent timeouts detected"
description: "Agent {{ $labels.agent }} timing out"
# No healthy upstreams
- alert: SentinelNoHealthyUpstreams
expr: |
sum(sentinel_circuit_breaker_state{component="upstream"})
== count(sentinel_circuit_breaker_state{component="upstream"})
for: 1m
labels:
severity: critical
annotations:
summary: "No healthy upstreams"
Grafana Dashboard
Key panels for a Sentinel dashboard:
- Request Rate -
rate(sentinel_requests_total[5m]) - Error Rate - 5xx / total
- Latency P50/P95/P99 - histogram_quantile
- Active Requests -
sentinel_active_requests - Upstream Health - circuit breaker states
- Agent Latency - agent_latency histogram
- Connection Pool - size vs idle
- Memory/CPU - system metrics
See Also
- Observability - Logging and tracing
- Error Codes - Error types and codes