Complete observability setup with Prometheus metrics, Grafana dashboards, and Jaeger distributed tracing.
Use Case
- Monitor request rates, latencies, and errors
- Visualize traffic patterns and health status
- Trace requests across services
- Alert on anomalies
Architecture
┌─────────────────┐
│ Sentinel │
│ :8080/:9090 │
└────────┬────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│Prometheus │ │ Grafana │ │ Jaeger │
│ :9091 │◄─────│ :3000 │ │ :16686 │
└───────────┘ └───────────┘ └───────────┘
Configuration
Create sentinel.kdl:
// Observability Configuration
// Metrics, logging, and distributed tracing
server {
worker-threads 0
graceful-shutdown-timeout-secs 30
}
listeners {
listener "http" {
address "0.0.0.0:8080"
protocol "http"
}
}
routes {
route "api" {
matches {
path-prefix "/api/"
}
upstream "backend"
}
}
upstreams {
upstream "backend" {
targets {
target { address "127.0.0.1:3000" }
}
health-check {
type "http" {
path "/health"
}
interval-secs 10
}
}
}
observability {
// Prometheus metrics endpoint
metrics {
enabled true
address "0.0.0.0:9090"
path "/metrics"
}
// Structured JSON logging
logging {
level "info"
format "json"
access-log {
enabled true
fields ["method" "path" "status" "latency" "upstream" "client_ip"]
}
}
// OpenTelemetry tracing
tracing {
enabled true
service-name "sentinel"
endpoint "http://jaeger:4317"
protocol "grpc" // or "http"
sample-rate 1.0 // Sample all requests (reduce in production)
propagation "w3c" // W3C Trace Context
}
}
Prometheus Setup
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'sentinel'
static_configs:
- targets:
metrics_path: /metrics
- job_name: 'sentinel-agents'
static_configs:
- targets:
- 'sentinel-waf:9091'
- 'sentinel-auth:9092'
- 'sentinel-ratelimit:9093'
Key Metrics
| Metric | Type | Description |
|---|---|---|
sentinel_requests_total | Counter | Total requests by route, method, status |
sentinel_request_duration_seconds | Histogram | Request latency distribution |
sentinel_upstream_requests_total | Counter | Requests per upstream target |
sentinel_upstream_latency_seconds | Histogram | Upstream response times |
sentinel_upstream_health | Gauge | Upstream health (1=healthy, 0=unhealthy) |
sentinel_connections_active | Gauge | Active client connections |
sentinel_agent_duration_seconds | Histogram | Agent processing time |
sentinel_agent_errors_total | Counter | Agent errors by type |
Useful PromQL Queries
# Request rate (requests per second)
rate(sentinel_requests_total[5m])
# Error rate (5xx responses)
rate(sentinel_requests_total{status=~"5.."}[5m]) / rate(sentinel_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(sentinel_request_duration_seconds_bucket[5m]))
# Upstream health status
sentinel_upstream_health
# Requests by route
sum by (route) (rate(sentinel_requests_total[5m]))
Grafana Dashboard
dashboard.json
Jaeger Tracing
docker-compose.yml
version: '3.8'
services:
sentinel:
image: ghcr.io/raskell-io/sentinel:latest
ports:
- "8080:8080"
- "9090:9090"
volumes:
- ./sentinel.kdl:/etc/sentinel/sentinel.kdl
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
jaeger:
image: jaegertracing/all-in-one:1.50
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
prometheus:
image: prom/prometheus:v2.47.0
ports:
- "9091:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:10.1.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
Trace Context Propagation
Sentinel propagates trace context through requests:
# Incoming request with trace context
Backend receives headers:
traceparent- W3C Trace Contexttracestate- Vendor-specific trace stateX-Request-Id- Sentinel request ID
Viewing Traces
- Open Jaeger UI: http://localhost:16686
- Select service:
sentinel - Find traces by:
- Operation (route name)
- Tags (status, method, path)
- Duration
- Request ID
Alerting
Prometheus Alerting Rules
Create alerts.yml:
groups:
- name: sentinel
rules:
- alert: HighErrorRate
expr: |
sum(rate(sentinel_requests_total{status=~"5.."}[5m]))
/ sum(rate(sentinel_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(sentinel_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s"
- alert: UpstreamDown
expr: sentinel_upstream_health == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Upstream target is down"
description: "{{ $labels.upstream }}/{{ $labels.target }} is unhealthy"
- alert: AgentErrors
expr: rate(sentinel_agent_errors_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Agent errors detected"
description: "Agent {{ $labels.agent }} has errors"
Log Aggregation
Structured Log Output
Loki Integration
# loki configuration
scrape_configs:
- job_name: sentinel
static_configs:
- targets:
- localhost
labels:
job: sentinel
__path__: /var/log/sentinel/*.log
Testing
Verify Metrics
|
Generate Test Traffic
# Install hey (HTTP load generator)
# Generate load
Check Traces
# Make a traced request
# Find in Jaeger by tag: request_id=test-trace-123
Next Steps
- Security - Add WAF and auth monitoring
- Microservices - Trace across services
- Load Balancer - Monitor upstream distribution