Skip to content

Observability

Both Pulse planes are observable by default. Metrics are scraped by Prometheus, logs flow to Loki, and traces go to Tempo over OTLP — all visualized in Grafana. Nothing here requires an external service.

The stack

Concern Tool Local URL
Dashboards Grafana http://localhost:3000 (admin / admin)
Metrics Prometheus http://localhost:9090
Logs Loki :3100
Traces Tempo OTLP gRPC :4317, HTTP :4318

Metrics endpoints

  • API: Prometheus metrics are mounted at /metrics (Starlette canonicalizes to /metrics/, which is the path Prometheus scrapes). FastAPI and asyncpg are auto-instrumented.
  • Worker: exposes metrics on :9100 via a daemon-thread WSGI server.

Metrics catalog

Metric Type Labels Plane
pulse_api_requests_total counter route, status API
pulse_api_request_seconds histogram route API
pulse_worker_events_consumed_total counter tenant, dataset worker
pulse_worker_duplicates_dropped_total counter tenant worker
pulse_worker_events_dlq_total counter tenant, reason worker
pulse_worker_iceberg_append_errors_total counter tenant, dataset worker
pulse_worker_batch_flush_seconds histogram tenant, dataset worker
pulse_worker_buffer_rows gauge tenant, dataset worker

The API also publishes a catalog of pulse_api_collect_* and pulse_api_dataset_preview_* series that are wired in a later phase.

Per-tenant label cardinality

The tenant label is fine for hundreds of tenants but will not scale to tens of thousands (Prometheus guidance: avoid labeling by unbounded sets). The planned mitigation is recording rules / top-N bucketing (tenant → tenant_bucket).

Traces

Spans are exported over OTLP gRPC to tempo:4317 using a BatchSpanProcessor (which drops spans on queue overflow rather than blocking). aiokafka, asyncpg, and FastAPI are auto-instrumented, so a collect → publish → consume → append flow is traceable end to end. Configure the endpoint with PULSE_OTLP_ENDPOINT.

Health checks

  • API: GET /ready returns {"status":"ready","checks":{"postgres":true,"kafka":true,"minio":true}} when all dependencies are reachable; GET /health is a plain liveness probe.
  • Worker: GET :9101/healthz returns 200 healthy only when the Kafka consumer holds ≥1 partition assignment and Postgres answers SELECT 1 within 1s; otherwise 503. There's a brief false-negative window between consumer start and first heartbeat — the Compose healthcheck's start_period and retries absorb it.

See also