Skip to content

Datasets and the Data Lake

A dataset is where ingested events land and become queryable. Each dataset materializes one Apache Iceberg table on object storage, bound to a published PDM schema. The lake is vendor-neutral and multi-engine — Iceberg tables are readable by Spark, Trino, and DuckDB — rather than a proprietary store.

Anatomy of a dataset

Creating a dataset (POST /v1/datasets) does three things:

  1. Records a pdm_datasets row (tenant-scoped, RLS-isolated) with a slug, name, and the schema_uri it's bound to.
  2. Ensures an Iceberg namespace for the tenant — pulse_{tenant_uuid} — and a table named after the dataset slug.
  3. Partitions the table by day(timestamp) so time-range scans prune efficiently.
flowchart LR
    api[POST /v1/datasets] -->|row| pg[(Postgres<br/>pdm_datasets, RLS)]
    api -->|ensure_table| cat[pyiceberg SqlCatalog]
    cat -->|catalog rows| icebergpg[(iceberg_catalog schema)]
    cat -->|table metadata + data| minio[(MinIO<br/>s3://pulse/iceberg/)]

Storage layout

  • Object storage: MinIO locally (s3://pulse/iceberg/), S3-compatible in production.
  • Catalog: a pyiceberg SqlCatalog reusing the existing Postgres instance, but confined to a dedicated iceberg_catalog schema (it owns only iceberg_tables and iceberg_namespace_properties). Catalog plumbing never touches the application's public schema.
  • One table per (tenant, schema): namespace pulse_{tenant_uuid}, table = dataset slug.

How data arrives

The write path is asynchronous. POST /v1/collect returns 202 after publishing to Kafka; the ingestion worker consumes per-tenant topics, micro-batches by (tenant, dataset), and appends to Iceberg:

  • Micro-batch flush: every 30 seconds from the first buffered record, or at 10,000 rows, whichever comes first.
  • Exactly-once into Iceberg: a Postgres dedup gate (processed_event_ids) ensures a re-delivered event never produces a duplicate row. See the architecture overview.
  • Dead-letter routing: events that fail validation or can't be resolved go to the tenant's events.dlq topic instead of corrupting the table.

Querying: dataset preview

GET /v1/datasets/{slug}/preview?limit=N reads straight from the lake. The dataset row is first resolved under your tenant's RLS-scoped session, so a slug another tenant owns yields 404 and DuckDB never sees a foreign table. The API then runs a per-request DuckDB iceberg_scan over the table's current metadata in MinIO.

Tenant isolation precedes the query engine

Even when two tenants both have a dataset called orders, each preview resolves to that tenant's own pulse_{tenant} namespace. Isolation is enforced before the scan, not by it. See Multi-Tenancy and RLS.

Schema evolution caveats (v0.1)

  • A published schema's definition is immutable — evolving an event's shape means a new schema version, not an in-place edit.
  • Iceberg supports column add/rename/reorder, but Pulse does not yet expose dataset-level schema evolution operations; treat v0.1 datasets as append-only against a fixed shape.
  • Compaction/maintenance that rewrites manifests is not run in v0.1, which is why preview reads use allow_moved_paths = false.

See also