DuckDB

§ 01 — Orientation

The frame that explains almost everything

Database paradigms divide along two orthogonal axes: workload type (transactional vs analytical) and process topology (in-process vs networked). Each cell in the resulting 2×2 was filled by mature, well-known systems — except one. DuckDB occupies that empty quadrant, and almost everything about its design and value follows from that fact.

Click a quadrant below to see what lives there, and why DuckDB's position is structural rather than accidental.

Click any cell · Workload (rows) × Topology (columns)

In-process

Client–Server

OLTP

SQLite

Embedded transactional. The world's most-deployed database.

PostgreSQL · MySQL · Oracle

The default for application backends. Networked, row-oriented.

OLAP

In-process columnar analytics. The structurally empty quadrant.

← Empty until 2019

Snowflake · BigQuery · ClickHouse

Networked warehouses. Built for scale, paid for in dollars.

Click a quadrant Each cell encodes a set of assumptions about where compute lives, how data flows, and what failure looks like.

Transferable move. When any "X-for-Y" technology appears, ask which conventional axis it is collapsing. That's almost always where the value is. The empty quadrant of a 2×2 is more interesting than a fully populated one.

§ 02 — Identity

What it actually is

DuckDB is an in-process, column-oriented SQL OLAP database, written in dependency-free C++17, with first-class bindings for nearly every language, a single-file storage format, and a WebAssembly build that runs in any modern browser.

It was started by Mark Raasveldt and Hannes Mühleisen at CWI Amsterdam in 2019, is governed by the non-profit DuckDB Foundation, and is commercially supported by DuckDB Labs — which has refused venture capital on the grounds that it would distort the project's direction. A separate company, MotherDuck, builds a cloud platform on top of DuckDB and has raised conventionally.

Architecture

In-process · columnar · vectorized · ACID

Language

C++17, zero external dependencies

SQL Dialect

PostgreSQL-derived (via pg_query) + analytical extensions

Interchange

Apache Arrow (zero-copy), Parquet, CSV, JSON, ORC

Runs on

Linux · macOS · Windows · iOS · Android · WebAssembly

Bindings

Python · R · Java · Go · Rust · Node · Ruby · C# · Swift · …

License

MIT

Current

v1.5 "Variegata" (May 2026); v2.0 expected September 2026

§ 03 — Mechanism

Why it's fast

The single most important physical fact about DuckDB is that it stores data by column, not by row. For analytical queries — which typically touch a few columns across many rows — this changes how much data must be read from disk by an order of magnitude or more.

The visualization below makes the difference visible. Pick a query, then toggle between row-stored and column-stored layout. Watch which bytes get read.

Query Storage

Logical Table (what you query)

name

city

amount

Alice

Austin

$50

Boston

$120

Chicago

$25

Denver

$200

Physical Layout · Column-stored

Bytes Read

16/16

Efficiency

25%

Wasted I/O

4×

The lesson generalizes: data layout should match dominant access pattern. Row stores favor "give me the whole record for one user." Column stores favor "give me one attribute across all users." OLTP is the first; OLAP is the second.

Columnar storage is the headline, but four other engineering choices compound it. Each carries a transferable principle.

Vectorized execution

Operators process batches of ~1024 values, not one tuple. Amortizes interpretation cost, exploits SIMD, stays in L1 cache.

Batch size is a tuning knob everywhere in computing — between latency and throughput, between cache lines and main memory.

ii.

Zero external dependencies

Builds with a C++17 compiler and nothing else. This is what makes embedding everywhere — including WebAssembly — actually possible.

Portability is a function of dependency closure, not source code purity.

iii.

Apache Arrow interchange

Zero-copy exchange with pandas, Polars, R, NumPy. In heterogeneous pipelines, the bottleneck is rarely compute — it's marshalling.

In mixed-language stacks, serialization is the hidden cost. Eliminate it and the system feels different.

iv.

PostgreSQL-derived SQL

The parser descends from pg_query. Familiarity cost is near-zero for Postgres-literate engineers. Plus ergonomic extensions: EXCLUDE, COLUMNS(*), QUALIFY.

Adoption cost is dominated by mental-model carryover, not feature count.

Hardware-aware design

Modern NVMe drives push 7+ GB/s; consumer machines have hundreds of GB of RAM. A single node is more capable than 2015's data center.

Many architectural assumptions encode hardware that no longer exists. Re-examine them every five years.

§ 04 — Topology

Where it fits in your stack

Because DuckDB is in-process and dependency-free, it can run in many places that a traditional analytical database cannot. The architectural question stops being "do I provision a warehouse?" and becomes "where in this system should the analytical engine live?" Below are four canonical placements.

The deep observation: DuckDB is deployment-neutral. The same SQL runs on a laptop, in a browser tab, inside a serverless function, on a fat EC2 instance, against a 50-TB Parquet lake. The query is the contract; where it runs is configurable per workload. Most database choices lock you into a topology. DuckDB does not.

§ 05 — Decision

Should you use it?

The honest question isn't "is DuckDB cool?" — it is "do I have a workload that my current stack handles badly, which DuckDB handles natively?" Five diagnostics, each worth one signal. Check the ones that apply.

Diagnostic · is there a DuckDB-shaped problem in your system?

Slow analytical queries hit my OLTP database. Reporting queries lock tables, wreck p99 latency, or compete with transactional traffic.

We have dashboards or scheduled reports. Admin views, customer-facing analytics, internal BI, scheduled exports.

We run ETL or batch jobs that move and reshape data. Nightly aggregations, format conversions, partner integrations.

CSV or Excel import-export is a recurring pain point. Users upload large files; we generate spreadsheets at scale.

We want to embed analytics in the product itself. End-user-facing charts, trend visualizations, "your data" views.

Verdict

Check boxes above to see whether DuckDB is a fit.

0 of 5 signals

Worked Example · Medium Rails Application

What this looks like for a faith/giving platform on Rails

A faith-discovery and online-giving Rails app typically scores 3–4 on the diagnostic above. Postgres remains the system of record for users, churches, donations, recurring gifts. DuckDB enters on the analytical surface: nightly Parquet exports to S3, dashboard queries hitting those Parquets via the duckdb Ruby gem in Sidekiq workers, year-end tax-statement generation, donor-list CSV imports cleaned via read_csv_auto. None of this touches the OLTP path.

The risk profile is unusually favorable: a single Sidekiq job is the smallest viable adoption unit, and the whole apparatus can be ripped out in an afternoon if disappointing. The largest gotcha is treating DuckDB as multi-writer — until the Quack protocol matures, each DuckDB file should have one writer.

§ 06 — Limits

Where it doesn't help — and might hurt

As your primary application database

DuckDB lacks per-row OLTP optimizations and (until Quack stabilizes) multi-writer concurrency. Don't put user accounts, sessions, or active business records here. Postgres or MySQL remains correct.

For sub-second data freshness

If a dashboard must reflect a write made five seconds ago, batched Parquet exports won't cut it. Use logical replication into DuckDB, query the OLTP database directly, or accept the lag — depending on tolerance.

Multi-writer to the same file

Multiple processes writing one DuckDB file corrupts it. The fix is architectural: single-writer-per-file, or wait for the Quack client–server protocol to mature.

At hyperscale OLAP with high concurrency

Snowflake, BigQuery, Databricks, ClickHouse remain dominant for 10s of TB of data with hundreds of concurrent analysts. DuckDB scales up, not out. Past a certain size, distributed wins.

On tiny workloads, prematurely

If your "analytics" is one COUNT(*) per page and Postgres handles it in 5ms, adding infrastructure adds risk without return. New tech earns its place; doesn't inherit it.

Browser deployments without thought

A 50 MB WASM blob over a slow connection is a UX failure. Mitigations: brotli, lazy load, pre-warm. DuckDB-WASM is also largely single-threaded today unless cross-origin isolation is configured.

§ 07 — Frontier

What's emerging

Each item below is flagged speculative — these are bets on direction, not statements of fact. Track them through 2026–2027.

SpeculativeQuack reshapes the architecture question

Announced May 2026, the HTTP-based Quack protocol lets multiple DuckDB instances communicate — including browser DuckDB-WASM talking to a server DuckDB. If it gains traction, DuckDB becomes the first analytical database that doesn't force a choice between in-process and networked. You'd choose per deployment, possibly per query.

SpeculativeThe death of the medium-sized warehouse

Hardware curves favor single-node analytics. For organizations whose working set is under ~10 TB, the case for a multi-node warehouse is eroding. Teams report cutting Snowflake costs by ~79% using DuckDB-based caching. DuckDB is the technical embodiment of that thesis.

SpeculativeLocal-first analytics as a UX paradigm

When the analytical engine ships to the user's device — searchable, offline, private — products become qualitatively different. Linear and Obsidian point at this in OLTP-shaped tools; DuckDB-WASM extends it to data-heavy ones.

SpeculativeSQL as the LLM interface

Frontier LLMs now hit ~95% accuracy on text-to-SQL with schema-only context. MotherDuck's MCP server lets agents query data conversationally. The pedagogical implication is counter-intuitive: SQL literacy is becoming more valuable, not less, because it's the stable interface between humans, models, and data.

SpeculativeConvergence with Postgres via pg_duckdb

The pg_duckdb extension puts DuckDB's columnar engine inside Postgres. The endgame may be that "what database do I use" stops mattering — the row engine and the columnar engine coexist behind one SQL dialect.

§ 08 — Takeaway

Three mental models that transfer

Workload × topology is a 2×2

When evaluating any data system, separate workload (read/write ratio, batch size, latency) from topology (where compute can live). DuckDB matters because it decouples them. Most tools force-couple, and you inherit the coupling.

Bring the engine to the data

The deep idea behind in-process analytics, embedded ML, edge functions, local-first apps. The network is the slowest, costliest, riskiest layer — minimize crossings. This is the unifying thread of most interesting infrastructure work right now.

iii

SQL is a stable interface

It survived hierarchical DBs, object DBs, NoSQL, NewSQL, the big-data era, and now the LLM era. Tools that take SQL seriously as a portable contract — DuckDB, dbt, Postgres extensions — accrete value with time. Bet on stable interfaces.