A Field Guide · ~10 minute read

Thinking in Parquet

A primer on columnar storage — and the pipelines built around it.

Parquet exists because the physical layout of data on disk should match the access pattern of the workload. That sentence is the entire skill tree. Everything below is unpacking it.

§ 01

The pivot: row-major versus column-major

Consider a tiny table of events. Each row is a transaction; each column is an attribute. There are exactly two ways to lay this out as bytes on disk: store one row at a time (all attributes of row 1, then all of row 2…) or store one column at a time (all values of id, then all values of user…).

This single choice — applied at the byte level — cascades into nearly every property that distinguishes analytical from transactional systems. The interactive below makes the consequence visible. Pick a query, then flip between row- and column-major storage and watch which cells actually need to be read.

Interactive 01
Storage layout vs query access

Toggle the storage mode and the query. The active cells are what the disk must read; faded cells are skipped. Notice how the same query reads dramatically different amounts depending on layout.

Layout
Query
Logical table (5 rows × 4 columns)
Storage tape — left to right is contiguous bytes on disk
id
user
country
amount
Bytes read: 4 / 20 (20%)

The most important case to internalize is the asymmetry. Analytical queries touch few columns across many rows — they thrive in column-major layout, where unneeded columns are simply never touched. Transactional queries (point lookups, single-record updates) touch many columns of one row — they thrive in row-major layout, where one row is one contiguous read. Parquet is built for the first kind. Postgres-style row storage is built for the second. Neither is "better"; they're optimized for opposite ends of the same axis.

§ 02

Four compounding wins

The column-major decision unlocks four advantages that stack multiplicatively. Each is a consequence of the same insight: bytes adjacent on disk are now homogeneous in type and meaning.

1. I/O locality

You only read the columns the query needs. On a 100-column table, that means 100× less data fetched for a query touching one column. This is the dominant win. The other three amplify it.

2. Compression

Compression algorithms (Snappy, ZSTD, gzip) work by finding repetition. A column of country codes has perhaps 200 distinct values; a column mixing int64, string, decimal, timestamp looks like noise to a compressor. Parquet routinely achieves 5–20× smaller files than CSV.

3. Type-aware encodings

Before general compression, Parquet applies encodings that exploit a column's distribution. This is where the format goes beyond "just sort similar bytes together."

Interactive 02
Encoding explorer

Each encoding exploits a different property of a column. Pick one to see how it transforms the data, why it saves space, and what kind of column it suits.

Encoding
Source column
After encoding
160 bytes 160 bytes 0% saved
Raw storage: each value stored as-is. No transformation. This is the baseline.

These encodings are typically stacked: a dictionary turns strings into integer indices, bit-packing shrinks those integers to the minimum bit-width that fits the dictionary range, and a general-purpose compressor squeezes whatever's left. A column with 200 distinct values across a billion rows can end up at roughly one bit per row.

4. Predicate pushdown

Parquet stores min/max statistics for every column at every level of the file (page, column chunk, row group). Before reading data, the query engine consults these statistics and skips entire chunks that can't possibly match the filter. This is shown in §03.

§ 03

The file's anatomy

Parquet isn't "columns on disk" — it's a layered container designed so each level enables a different optimization.

File
├─ Row Group 1 (~128 MB — the unit of parallelism)
│  ├─ Column Chunk: id → Page 1, Page 2, …
│  ├─ Column Chunk: user → Page 1, Page 2, …
│  └─ Column Chunk: amount → Page 1, Page 2, …
├─ Row Group 2
└─ Footer (schema + statistics + offsets — read first)

The footer is the file's table of contents. Readers seek to the end first, parse the schema and statistics, then make targeted reads. This is why Parquet is self-describing: no sidecar schema file, no guessing types.

The statistics in the footer are where the magic of predicate pushdown lives. For each row group and page, Parquet stores the min and max of every column. A query with a filter consults these stats before reading anything else — and skips whatever can't possibly match.

Interactive 03
Predicate pushdown simulator

A 6-row-group Parquet file, partitioned by event date. Drag the threshold; row groups whose max date falls below it are skipped entirely — no decompression, no read. Watch how much data the engine doesn't touch.

SELECT * FROM events WHERE event_date > 2025-04-15
2025-04-15
Data scanned: 50% of file

The effectiveness of pushdown depends on clustering: if a column is sorted (or files are partitioned by it), min/max stats tightly bound each chunk and most queries skip most of the file. If a column is randomly distributed, the min and max of every chunk span the full range and pushdown does nothing. This is why partitioning by date is so common — date filters become near-free.

§ 04

Where Parquet breaks

Understanding failure modes is how mental models prove their worth. Parquet has clear ones.

Failure modeWhyWhat to use instead
Single-row reads Must decompress at least one page per column to retrieve one row Postgres, key-value DB
Updates & deletes Files are immutable; in-place edits impossible Iceberg, Delta, Hudi atop Parquet
Streaming ingest Row groups are written as complete units; tiny batches → terrible compression Avro for the wire, batch into Parquet
Small files Many tiny Parquet files destroy throughput (footer overhead, metadata stat()s, no compression locality) Compaction jobs targeting 128 MB–1 GB
Schema evolution Limited compared to Avro; renames are awkward Iceberg manages this above Parquet

The deepest of these is immutability. It's why the modern stack split into two layers: Parquet is the storage substrate, and Iceberg / Delta / Hudi are the transactional metadata above it. The metadata tracks which Parquet files constitute the current table, what their schema is, and a transaction log of changes. This is the same trick LSM trees use, that Git uses, that functional persistent data structures use: immutable storage + a thin mutable metadata layer = mutability with the read-time performance of immutability.

§ 05

The pipeline around Parquet

Parquet rarely stands alone. It sits inside a pattern that recurs across data systems: match the format to the access pattern at each layer. A useful mnemonic:

Avro on the wire. Parquet on disk. Arrow in RAM.

Each is columnar or row-oriented based on what the dominant access pattern is at that layer. Streaming systems pass single records — row-oriented Avro is right. Analytical scans hit many rows across few columns — columnar Parquet is right. In-process compute needs zero-copy interchange between engines — columnar Arrow is right.

The standard streaming-to-analytics pipeline composes these:

Producers → Kafka (durable log, Avro records) → Stream processor ↓ Compaction every 5–60 min ↓ Parquet files (128 MB–1 GB) on object storage ↓ Iceberg / Delta metadata layer ↓ Query engines (Trino, Spark, DuckDB, Flink)

Why two stages? Because a single event written directly to Parquet is a disaster: 100,000 events/sec becomes 100,000 tiny files/sec, each with footer overhead larger than its payload, with compression dictionaries built on samples of size one. The format inverts. You write Avro into Kafka because Kafka is built for the streaming access pattern, and you compact into Parquet because Parquet is built for the analytical access pattern.

A brief history · Lambda & Kappa

You'll hear two architecture names. Lambda (Nathan Marz, 2011) ran two pipelines: an accurate-but-slow batch pipeline and a fast-but-approximate streaming pipeline, with results merged at query time. The shape resembled the Greek letter λ; the name also nodded to functional programming's commitment to pure recomputation from immutable data.

Kappa (Jay Kreps, 2014) argued the split was unnecessary: a sufficiently durable log plus a sufficiently capable stream processor lets "batch" become "replay the log from the beginning." One codebase, two execution modes.

Most production systems in 2026 are hybrid: a Kafka log as system of record (Kappa-style code reuse), but with Parquet/Iceberg as a cold-storage projection optimized for analytics (Lambda-style separation). The terminology survives because the underlying tradeoff — one pipeline or two — is still a live design question.

§ 06

Scale tiers in 2026

No single architecture covers all scales. The conventional patterns stratify by data volume, freshness requirements, and team size. The single most common mistake is adopting an architecture two tiers above where you actually are — paying operational cost for problems you don't have.

Interactive 04
Pipeline architectures by scale

Pick a tier. The diagram and stats update to reflect the conventional stack at that scale. Read the descriptions to feel how complexity compounds.

The transitions between tiers matter more than the tiers themselves. Tier 0 → 1 is triggered by needing multiple people querying the same data, or freshness more often than ad-hoc. Tier 1 → 2 is triggered by warehouse costs growing faster than the business, or genuine real-time requirements, or ML workloads that need open compute. Tier 2 → 3 happens when no single storage system can serve every workload, and you accept federated specialized systems with the lakehouse as system of record.

The architectural distinction worth carrying forward: warehouse-centric (Tier 1) keeps storage and compute coupled in a vendor's system; lakehouse-centric (Tier 2+) decouples them via open formats (Parquet + Iceberg) on object storage, with engine choice as a runtime decision. The 2024–2026 stretch saw the lakehouse pattern win as the strategic default for new platforms above ~10TB, with Iceberg emerging as the dominant table format after Databricks' acquisition of Tabular consolidated the space.

§ 07

The transferable principle

The deepest reason to study Parquet is that it's a clean instance of a principle that appears everywhere in systems design:

Physical layout should match access pattern. Heterogeneity at the granularity of access is the enemy of performance.

Once you see this, you see it everywhere:

DomainHeterogeneous layout (slow)Homogeneous layout (fast)
CPU cache Array of structs Struct of arrays (SoA)
GPU memory Threads read scattered fields Coalesced columnar access
Databases Full table scan Covering index (columnar projection)
ML training Random-access filesystem TFRecord / WebDataset (sequential)
Time series General-purpose storage ClickHouse / Druid (time-clustered)

When you encounter a new format or storage system, the five questions to ask before reading the docs:

  1. What is the unit of write? (record, batch, file)
  2. What is the unit of read? (row, column, range)
  3. What is the unit of mutation? (cell, row, file, snapshot)
  4. What is immutable, and what's the mutable layer that compensates?
  5. Where does it sit in the hot-path → cold-path pipeline?

Five answers, eighty percent of what matters. Parquet is one expression of this thinking at one layer — durable, immutable, analytical storage. The thinking is what transfers.