Monitoring & Observability¶

The project implements the three pillars of observability: metrics (Prometheus), logs (Loki), and real-time dashboards (Grafana). Every service exposes a /metrics endpoint, and all container logs are collected automatically.

Stack¶

Tool	Role	Retention
Prometheus	Time-series metrics, scrapes `/metrics` every 15s	3 days / 256 MB
Grafana	Visualization, dashboards, log exploration	–
Loki	Log aggregation (like Prometheus, but for logs)	72 hours
Promtail	Log shipper, Docker service discovery	–

Grafana Dashboards¶

Three pre-provisioned, read-only dashboards ship with the project. They’re automatically loaded on startup (Docker Compose: volume mounts, Kubernetes: ConfigMap sidecar).

HTTP Metrics (`/grafana/d/http-metrics`)¶

Answers: “Is the API healthy? Where are the bottlenecks?”

Request rate per service (req/s)
Error rate (5xx) per service
Latency percentiles (p50, p95, p99)
Requests by status code (stacked bars)
Top 10 endpoints by request count

Application Logs (`/grafana/d/application-logs`)¶

Answers: “What happened? What errors are occurring?”

Log volume per container (stacked bars)
Error log count (error/exception/traceback keywords)
Live log stream with full-text search
Filterable by container name

Event Pipeline (`/grafana/d/event-pipeline`)¶

Answers: “Is the event bus healthy? Are messages being processed?”

Message throughput per stream (msg/s)
Processing error rate by stream and consumer group
Processing latency (p50, p95, p99)
Dead-letter queue rate and cumulative count
Success rate gauge (green > 99%, yellow > 95%, red below)
Messages by consumer group (stacked bars)

Prometheus Metrics¶

HTTP Metrics (all services)¶

Defined in shared/src/shared/http_metrics.py. Middleware automatically instruments every request.

Metric	Type	Labels
`http_requests_total`	Counter	`method`, `path`, `status_code`
`http_request_duration_seconds`	Histogram	`method`, `path`

Note

The GZip middleware is configured to skip /metrics to prevent Prometheus from receiving compressed responses it can’t parse.

Stream Metrics (all consumers)¶

Defined in shared/src/shared/redis/metrics.py. Consumer group processing is automatically instrumented.

Metric	Type	Labels
`stream_messages_processed_total`	Counter	`stream`, `group`, `status`
`stream_message_duration_seconds`	Histogram	`stream`, `group`
`stream_dlq_messages_total`	Counter	`stream`, `group`

The status label on stream_messages_processed_total distinguishes success from error, enabling per-stream error rate calculations.

Quick Links (Docker Compose)¶

Tool	URL	Access
Grafana (admin)	http://localhost/grafana/	admin / admin
HTTP Metrics dashboard	http://localhost/grafana/d/http-metrics	anonymous viewer
Application Logs dashboard	http://localhost/grafana/d/application-logs	anonymous viewer
Event Pipeline dashboard	http://localhost/grafana/d/event-pipeline	anonymous viewer
Prometheus	http://localhost/prometheus/	read-only (GET only)

Access Control¶

Demo-safe by default:

Prometheus is proxied through NGINX with limit_except GET { deny all; } – users can query metrics but cannot modify configuration or delete data
Grafana anonymous users get the Viewer role – dashboards are visible without login, but editing and deletion are blocked
Provisioned dashboards are marked editable: false and disableDeletion: true

Data Retention¶

Lightweight retention policies keep resource usage bounded, which matters on a Raspberry Pi cluster:

Prometheus: 3-day time retention + 256 MB size cap (whichever triggers first)
Loki: 72-hour retention with compactor auto-cleanup, 4 MB/s ingestion rate limit, 8 MB burst
Promtail: Only collects logs from application containers (order-service, delivery-service, notifications-service, order-simulator, nginx-proxy, frontend) – monitoring stack logs are excluded to avoid feedback loops

Correlation & Tracing¶

Instead of a dedicated tracing system (Jaeger, Zipkin), the project uses correlation_id propagation:

Order service generates a UUID correlation_id when an order is created
The ID is included in every event envelope published to Redis Streams
Every service logs the correlation_id with each action
In Loki/Grafana, you can filter by correlation_id to see the full lifecycle of an order across all services

Monitoring & Observability¶

Stack¶

Grafana Dashboards¶

HTTP Metrics (/grafana/d/http-metrics)¶

Application Logs (/grafana/d/application-logs)¶

Event Pipeline (/grafana/d/event-pipeline)¶

Prometheus Metrics¶

HTTP Metrics (all services)¶

Stream Metrics (all consumers)¶

Quick Links (Docker Compose)¶

Access Control¶

Data Retention¶

Correlation & Tracing¶

HTTP Metrics (`/grafana/d/http-metrics`)¶

Application Logs (`/grafana/d/application-logs`)¶

Event Pipeline (`/grafana/d/event-pipeline`)¶