Monitoring & Observability¶
The project implements the three pillars of observability: metrics (Prometheus), logs (Loki), and real-time dashboards (Grafana). Every service exposes a /metrics endpoint, and all container logs are collected automatically.
Stack¶
Tool |
Role |
Retention |
|---|---|---|
Prometheus |
Time-series metrics, scrapes |
3 days / 256 MB |
Grafana |
Visualization, dashboards, log exploration |
– |
Loki |
Log aggregation (like Prometheus, but for logs) |
72 hours |
Promtail |
Log shipper, Docker service discovery |
– |
Grafana Dashboards¶
Three pre-provisioned, read-only dashboards ship with the project. They’re automatically loaded on startup (Docker Compose: volume mounts, Kubernetes: ConfigMap sidecar).
HTTP Metrics (/grafana/d/http-metrics)¶
Answers: “Is the API healthy? Where are the bottlenecks?”
Request rate per service (req/s)
Error rate (5xx) per service
Latency percentiles (p50, p95, p99)
Requests by status code (stacked bars)
Top 10 endpoints by request count
Application Logs (/grafana/d/application-logs)¶
Answers: “What happened? What errors are occurring?”
Log volume per container (stacked bars)
Error log count (error/exception/traceback keywords)
Live log stream with full-text search
Filterable by container name
Event Pipeline (/grafana/d/event-pipeline)¶
Answers: “Is the event bus healthy? Are messages being processed?”
Message throughput per stream (msg/s)
Processing error rate by stream and consumer group
Processing latency (p50, p95, p99)
Dead-letter queue rate and cumulative count
Success rate gauge (green > 99%, yellow > 95%, red below)
Messages by consumer group (stacked bars)
Prometheus Metrics¶
HTTP Metrics (all services)¶
Defined in shared/src/shared/http_metrics.py. Middleware automatically instruments every request.
Metric |
Type |
Labels |
|---|---|---|
|
Counter |
|
|
Histogram |
|
Note
The GZip middleware is configured to skip /metrics to prevent Prometheus from receiving compressed responses it can’t parse.
Stream Metrics (all consumers)¶
Defined in shared/src/shared/redis/metrics.py. Consumer group processing is automatically instrumented.
Metric |
Type |
Labels |
|---|---|---|
|
Counter |
|
|
Histogram |
|
|
Counter |
|
The status label on stream_messages_processed_total distinguishes success from error, enabling per-stream error rate calculations.
Quick Links (Docker Compose)¶
Tool |
URL |
Access |
|---|---|---|
Grafana (admin) |
admin / admin |
|
HTTP Metrics dashboard |
anonymous viewer |
|
Application Logs dashboard |
anonymous viewer |
|
Event Pipeline dashboard |
anonymous viewer |
|
Prometheus |
read-only (GET only) |
Access Control¶
Demo-safe by default:
Prometheus is proxied through NGINX with
limit_except GET { deny all; }– users can query metrics but cannot modify configuration or delete dataGrafana anonymous users get the Viewer role – dashboards are visible without login, but editing and deletion are blocked
Provisioned dashboards are marked
editable: falseanddisableDeletion: true
Data Retention¶
Lightweight retention policies keep resource usage bounded, which matters on a Raspberry Pi cluster:
Prometheus: 3-day time retention + 256 MB size cap (whichever triggers first)
Loki: 72-hour retention with compactor auto-cleanup, 4 MB/s ingestion rate limit, 8 MB burst
Promtail: Only collects logs from application containers (order-service, delivery-service, notifications-service, order-simulator, nginx-proxy, frontend) – monitoring stack logs are excluded to avoid feedback loops
Correlation & Tracing¶
Instead of a dedicated tracing system (Jaeger, Zipkin), the project uses correlation_id propagation:
Order service generates a UUID
correlation_idwhen an order is createdThe ID is included in every event envelope published to Redis Streams
Every service logs the
correlation_idwith each actionIn Loki/Grafana, you can filter by
correlation_idto see the full lifecycle of an order across all services