# Monitoring & Observability
The project implements the **three pillars of observability**: metrics (Prometheus), logs (Loki), and real-time dashboards (Grafana). Every service exposes a `/metrics` endpoint, and all container logs are collected automatically.
## Stack
| Tool | Role | Retention |
|------|------|-----------|
| **Prometheus** | Time-series metrics, scrapes `/metrics` every 15s | 3 days / 256 MB |
| **Grafana** | Visualization, dashboards, log exploration | -- |
| **Loki** | Log aggregation (like Prometheus, but for logs) | 72 hours |
| **Promtail** | Log shipper, Docker service discovery | -- |
## Grafana Dashboards
Three pre-provisioned, read-only dashboards ship with the project. They're automatically loaded on startup (Docker Compose: volume mounts, Kubernetes: ConfigMap sidecar).
### HTTP Metrics (`/grafana/d/http-metrics`)
Answers: *"Is the API healthy? Where are the bottlenecks?"*
- Request rate per service (req/s)
- Error rate (5xx) per service
- Latency percentiles (p50, p95, p99)
- Requests by status code (stacked bars)
- Top 10 endpoints by request count
### Application Logs (`/grafana/d/application-logs`)
Answers: *"What happened? What errors are occurring?"*
- Log volume per container (stacked bars)
- Error log count (error/exception/traceback keywords)
- Live log stream with full-text search
- Filterable by container name
### Event Pipeline (`/grafana/d/event-pipeline`)
Answers: *"Is the event bus healthy? Are messages being processed?"*
- Message throughput per stream (msg/s)
- Processing error rate by stream and consumer group
- Processing latency (p50, p95, p99)
- Dead-letter queue rate and cumulative count
- Success rate gauge (green > 99%, yellow > 95%, red below)
- Messages by consumer group (stacked bars)
## Prometheus Metrics
### HTTP Metrics (all services)
Defined in `shared/src/shared/http_metrics.py`. Middleware automatically instruments every request.
| Metric | Type | Labels |
|--------|------|--------|
| `http_requests_total` | Counter | `method`, `path`, `status_code` |
| `http_request_duration_seconds` | Histogram | `method`, `path` |
:::{note}
The GZip middleware is configured to skip `/metrics` to prevent Prometheus from receiving compressed responses it can't parse.
:::
### Stream Metrics (all consumers)
Defined in `shared/src/shared/redis/metrics.py`. Consumer group processing is automatically instrumented.
| Metric | Type | Labels |
|--------|------|--------|
| `stream_messages_processed_total` | Counter | `stream`, `group`, `status` |
| `stream_message_duration_seconds` | Histogram | `stream`, `group` |
| `stream_dlq_messages_total` | Counter | `stream`, `group` |
The `status` label on `stream_messages_processed_total` distinguishes `success` from `error`, enabling per-stream error rate calculations.
## Quick Links (Docker Compose)
| Tool | URL | Access |
|------|-----|--------|
| Grafana (admin) | | admin / admin |
| HTTP Metrics dashboard | | anonymous viewer |
| Application Logs dashboard | | anonymous viewer |
| Event Pipeline dashboard | | anonymous viewer |
| Prometheus | | read-only (GET only) |
## Access Control
Demo-safe by default:
- **Prometheus** is proxied through NGINX with `limit_except GET { deny all; }` -- users can query metrics but cannot modify configuration or delete data
- **Grafana** anonymous users get the Viewer role -- dashboards are visible without login, but editing and deletion are blocked
- Provisioned dashboards are marked `editable: false` and `disableDeletion: true`
## Data Retention
Lightweight retention policies keep resource usage bounded, which matters on a Raspberry Pi cluster:
- **Prometheus**: 3-day time retention + 256 MB size cap (whichever triggers first)
- **Loki**: 72-hour retention with compactor auto-cleanup, 4 MB/s ingestion rate limit, 8 MB burst
- **Promtail**: Only collects logs from application containers (order-service, delivery-service, notifications-service, order-simulator, nginx-proxy, frontend) -- monitoring stack logs are excluded to avoid feedback loops
## Correlation & Tracing
Instead of a dedicated tracing system (Jaeger, Zipkin), the project uses `correlation_id` propagation:
1. Order service generates a UUID `correlation_id` when an order is created
2. The ID is included in every event envelope published to Redis Streams
3. Every service logs the `correlation_id` with each action
4. In Loki/Grafana, you can filter by `correlation_id` to see the full lifecycle of an order across all services