Design and benchmark a DuckDB-based analytical layer for genotype data, complementing the existing LMDB canonical storage. The goal is to evaluate whether a columnar relational backend can improve QTL analysis, correlation queries, and ad-hoc exploration without sacrificing the integrity guarantees provided by LMDB's versioned, hash-chained store.
This work is driven by the three-layer architecture proposed in the DuckDB design document:
LMDB excels at keyed matrix storage, versioning, and cryptographic verification, but it is not optimized for analytical queries such as:
DuckDB is an in-process, columnar OLAP engine that can query Parquet files directly with full SQL support.
Compare DuckDB + Parquet against LMDB for the following access patterns:
| Workload | LMDB approach | DuckDB approach | Success criterion | |----------|---------------|-----------------|-------------------| | Random marker lookup by ID | key → matrix row | SQL SELECT on indexed marker_id | comparable latency | | Range scan by chromosome & Mb | reconstruct full matrix, filter in Python | SQL WHERE chr = … AND Mb BETWEEN … | DuckDB faster | | Compute mean dosage per marker | reconstruct, NumPy mean | SQL GROUP BY + AVG | DuckDB faster | | Join with phenotype table | export to CSV/DataFrame, merge | SQL JOIN on sample_id | DuckDB significantly faster | | Export slice for R/qtl2 | reconstruct matrix, write .csv | COPY (SELECT …) TO 'slice.parquet' | DuckDB faster | | Storage size | LMDB delta-compressed | Parquet + SNAPPY/ZSTD | comparable or smaller | | Ingestion time | .geno → LMDB | .geno → Parquet → DuckDB view | acceptable overhead |
Closest to how high-performance QTL systems work. Marker metadata is stored once, eliminating redundancy that would bloat a long-format table where every genotype row repeats chr, cM, and Mb. DuckDB's columnar engine handles joins efficiently, so normalization does not hurt performance and improves both storage density and cache locality.
CREATE TABLE markers (
marker_id TEXT PRIMARY KEY,
chr TEXT,
cM DOUBLE,
Mb DOUBLE
);
CREATE TABLE genotypes (
dataset_id TEXT,
marker_id TEXT REFERENCES markers(marker_id),
sample_id TEXT,
genotype INTEGER, -- encoded: 0=B, 1=D, 2=H, etc.
allele_raw TEXT -- optional: original symbol for traceability
);
A single denormalized table. Useful when the priority is simplicity (e.g., one-shot export to an external tool that expects a flat CSV). However, it duplicates marker metadata on every row and therefore consumes more space.
CREATE TABLE genotypes_long (
dataset_id TEXT,
chr TEXT,
marker_id TEXT,
cM DOUBLE,
Mb DOUBLE,
sample_id TEXT,
genotype INTEGER,
allele_raw TEXT
);
Avoid the wide-format (one column per strain) because it requires DDL changes for every new sample.
Step 1 — Parse raw .geno (same as LMDB importer):
Step 2 — Light cleaning:
Step 3 — Split into normalized tables:
Step 4 — Export to Parquet:
import duckdb
con = duckdb.connect()
con.execute("""
COPY (SELECT * FROM read_csv_auto('markers.csv'))
TO 'markers.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD')
""")
con.execute("""
COPY (SELECT * FROM read_csv_auto('genotypes.csv'))
TO 'genotypes.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD')
""")
Step 5 — Create DuckDB views:
CREATE VIEW markers_latest AS
SELECT * FROM read_parquet('markers.parquet');
CREATE VIEW genotypes_latest AS
SELECT * FROM read_parquet('genotypes.parquet');
Step 6 — Analysis example (range scan + aggregation):
SELECT g.marker_id, m.chr, m.Mb, AVG(g.genotype) AS mean_dosage FROM genotypes_latest g JOIN markers_latest m USING (marker_id) WHERE m.chr = '1' AND m.Mb BETWEEN 50 AND 60 GROUP BY g.marker_id, m.chr, m.Mb;
LMDB remains the canonical source of truth. The DuckDB layer is a one-way, idempotent export:
.geno ──► LMDB (canonical, versioned)
│
└── reconstruct version N ──► split markers + genotypes ──► Parquet
│
▼
DuckDB analytical views
When a dataset is updated in LMDB, the export pipeline is re-run to regenerate the Parquet files. DuckDB never writes back to LMDB.
1. Should we maintain a persistent DuckDB database file, or rely on on-the-fly Parquet scanning? 2. Can we attach LMDB directly via a DuckDB custom scanner, avoiding the intermediate Parquet step? 3. How does DuckDB perform when multiple panels are queried together (e.g., UNION ALL across datasets)? 4. What is the memory footprint for large panels (>10k markers × >1k samples) in DuckDB versus LMDB reconstruction?