Datasets and the Data Lake¶
A dataset is where ingested events land and become queryable. Each dataset materializes one Apache Iceberg table on object storage, bound to a published PDM schema. The lake is vendor-neutral and multi-engine — Iceberg tables are readable by Spark, Trino, and DuckDB — rather than a proprietary store.
Anatomy of a dataset¶
Creating a dataset (POST /v1/datasets) does three things:
- Records a
pdm_datasetsrow (tenant-scoped, RLS-isolated) with aslug,name, and theschema_uriit's bound to. - Ensures an Iceberg namespace for the tenant —
pulse_{tenant_uuid}— and a table named after the dataset slug. - Partitions the table by
day(timestamp)so time-range scans prune efficiently.
flowchart LR
api[POST /v1/datasets] -->|row| pg[(Postgres<br/>pdm_datasets, RLS)]
api -->|ensure_table| cat[pyiceberg SqlCatalog]
cat -->|catalog rows| icebergpg[(iceberg_catalog schema)]
cat -->|table metadata + data| minio[(MinIO<br/>s3://pulse/iceberg/)] Storage layout¶
- Object storage: MinIO locally (
s3://pulse/iceberg/), S3-compatible in production. - Catalog: a pyiceberg
SqlCatalogreusing the existing Postgres instance, but confined to a dedicatediceberg_catalogschema (it owns onlyiceberg_tablesandiceberg_namespace_properties). Catalog plumbing never touches the application'spublicschema. - One table per
(tenant, schema): namespacepulse_{tenant_uuid}, table = dataset slug.
How data arrives¶
The write path is asynchronous. POST /v1/collect returns 202 after publishing to Kafka; the ingestion worker consumes per-tenant topics, micro-batches by (tenant, dataset), and appends to Iceberg:
- Micro-batch flush: every 30 seconds from the first buffered record, or at 10,000 rows, whichever comes first.
- Exactly-once into Iceberg: a Postgres dedup gate (
processed_event_ids) ensures a re-delivered event never produces a duplicate row. See the architecture overview. - Dead-letter routing: events that fail validation or can't be resolved go to the tenant's
events.dlqtopic instead of corrupting the table.
Querying: dataset preview¶
GET /v1/datasets/{slug}/preview?limit=N reads straight from the lake. The dataset row is first resolved under your tenant's RLS-scoped session, so a slug another tenant owns yields 404 and DuckDB never sees a foreign table. The API then runs a per-request DuckDB iceberg_scan over the table's current metadata in MinIO.
Tenant isolation precedes the query engine
Even when two tenants both have a dataset called orders, each preview resolves to that tenant's own pulse_{tenant} namespace. Isolation is enforced before the scan, not by it. See Multi-Tenancy and RLS.
Schema evolution caveats (v0.1)¶
- A published schema's definition is immutable — evolving an event's shape means a new schema version, not an in-place edit.
- Iceberg supports column add/rename/reorder, but Pulse does not yet expose dataset-level schema evolution operations; treat v0.1 datasets as append-only against a fixed shape.
- Compaction/maintenance that rewrites manifests is not run in v0.1, which is why preview reads use
allow_moved_paths = false.