Observability¶
Both Pulse planes are observable by default. Metrics are scraped by Prometheus, logs flow to Loki, and traces go to Tempo over OTLP — all visualized in Grafana. Nothing here requires an external service.
The stack¶
| Concern | Tool | Local URL |
|---|---|---|
| Dashboards | Grafana | http://localhost:3000 (admin / admin) |
| Metrics | Prometheus | http://localhost:9090 |
| Logs | Loki | :3100 |
| Traces | Tempo | OTLP gRPC :4317, HTTP :4318 |
Metrics endpoints¶
- API: Prometheus metrics are mounted at
/metrics(Starlette canonicalizes to/metrics/, which is the path Prometheus scrapes). FastAPI and asyncpg are auto-instrumented. - Worker: exposes metrics on
:9100via a daemon-thread WSGI server.
Metrics catalog¶
| Metric | Type | Labels | Plane |
|---|---|---|---|
pulse_api_requests_total | counter | route, status | API |
pulse_api_request_seconds | histogram | route | API |
pulse_worker_events_consumed_total | counter | tenant, dataset | worker |
pulse_worker_duplicates_dropped_total | counter | tenant | worker |
pulse_worker_events_dlq_total | counter | tenant, reason | worker |
pulse_worker_iceberg_append_errors_total | counter | tenant, dataset | worker |
pulse_worker_batch_flush_seconds | histogram | tenant, dataset | worker |
pulse_worker_buffer_rows | gauge | tenant, dataset | worker |
The API also publishes a catalog of pulse_api_collect_* and pulse_api_dataset_preview_* series that are wired in a later phase.
Per-tenant label cardinality
The tenant label is fine for hundreds of tenants but will not scale to tens of thousands (Prometheus guidance: avoid labeling by unbounded sets). The planned mitigation is recording rules / top-N bucketing (tenant → tenant_bucket).
Traces¶
Spans are exported over OTLP gRPC to tempo:4317 using a BatchSpanProcessor (which drops spans on queue overflow rather than blocking). aiokafka, asyncpg, and FastAPI are auto-instrumented, so a collect → publish → consume → append flow is traceable end to end. Configure the endpoint with PULSE_OTLP_ENDPOINT.
Health checks¶
- API:
GET /readyreturns{"status":"ready","checks":{"postgres":true,"kafka":true,"minio":true}}when all dependencies are reachable;GET /healthis a plain liveness probe. - Worker:
GET :9101/healthzreturns 200healthyonly when the Kafka consumer holds ≥1 partition assignment and Postgres answersSELECT 1within 1s; otherwise 503. There's a brief false-negative window between consumer start and first heartbeat — the Compose healthcheck'sstart_periodandretriesabsorb it.
See also¶
- Local Development — ports and make targets.
- System Overview — where telemetry fits in the topology.