Edit this page | Blame

Implementation of storing genotypes in LMDB

Tags

Description

Implement LMDB-based storage for genotype matrices to support fast, scalable access to large genotype datasets.

The system should support efficient storage and retrieval of genotype data along with metadata (samples, markers, positions, chromosomes). It should also include versioning of datasets and cryptographic verification to ensure data integrity across updates.

TL;DR (Quick Reference)

for quick usage setup:

# Enter environment

guix shell python-wrapper python-click python-lmdb python-numpy

# Import dataset

python lmdb_matrix.py import-genotype file.geno ./lmdb_store

# Force overwrite

python lmdb_matrix.py import-genotype file.geno ./lmdb_store --force

# Update dataset (new version)

python lmdb_matrix.py update-genotype DATASET file.geno ./lmdb_store

# List datasets

python lmdb_matrix.py list-datasets ./lmdb_store

# List versions

python lmdb_matrix.py list-versions DATASET ./lmdb_store

# Verify integrity

python lmdb_matrix.py verify DATASET ./lmdb_store

# Reconstruct latest

python lmdb_matrix.py reconstruct DATASET ./lmdb_store

# Reconstruct specific version

python lmdb_matrix.py reconstruct DATASET ./lmdb_store --version 2

For detailed explanations, examples, and outputs, see sections below.

Prerequisites

You need Guix installed.

Quick Start

1. Launch Guix Shell with Dependencies

guix shell python-wrapper python-click python-lmdb python-numpy python-pytest

2. Container (Isolated)

guix shell --container 
--manifest=manifest.scm 
--network 
--share=$HOME/genotyping 
-- 
bash

CLI Tool Usage

Importing a Genotype File

#### Basic import

guix shell python-wrapper python-click python-lmdb python-numpy -- 
python lmdb_matrix.py import-genotype 
~/genotype_files/genotype/BXD.geno 
./lmdb_store

#### With custom dataset ID and metadata

guix shell python-wrapper python-click python-lmdb python-numpy -- 
python lmdb_matrix.py import-genotype 
~/genotype_files/genotype/BXD.geno 
./lmdb_store 
--dataset-id "BXD_2018" 
--author "Arthur Centeno" 
--reason "Initial import from GeneNetwork"

#### Expected output

Parsing BXD.geno...
Dataset: BXD
Type: riset
Founders: ['B', 'D']
Matrix: 7343 markers x 198 samples

✓ Stored as version 1
Dataset ID: BXD_2018
Hash: a3f7b2c8d9e1f456...
Storage type: full
Timestamp: 2024-01-15T10:30:00Z

Importing an Existing Dataset

guix shell python-wrapper python-click python-lmdb python-numpy -- 
python lmdb_matrix.py import-genotype 
~/genotype_files/genotype/BXD.geno 
./lmdb_store

#### Output when dataset exists

Parsing BXD.geno...
Dataset: BXD
Type: riset
Founders: ['B', 'D']
Matrix: 7343 markers x 198 samples

⚠ Dataset 'BXD' already exists!
Current version: 3
Hash: b8e4c1d2a5f3e789...

update-genotype to create a new version.

Updating (create new version)

python lmdb_matrix.py update-genotype BXD file.geno ./lmdb_store \
  --author "qc_pipeline" \
  --reason "QC correction"

import-genotype vs update-genotype

| Command | Creates | Version | Storage | History | |------------------|-------------|-------------|----------------------|-------------| | import-genotype | New dataset | v1 (always) | Full matrix | Starts fresh | | update-genotype | New version | v2, v3... | Delta (changes only) | Preserved |

Updating a Dataset (Creating New Versions)

guix shell python-wrapper python-click python-lmdb python-numpy -- 
python lmdb_matrix.py update-genotype 
BXD_2018 
~/genotype_files/genotype/BXD_corrected.geno 
./lmdb_store 
--author "qc_pipeline" 
--reason "QC correction"

Verify

# Full verify: recomputes all hashes from payloads (thorough but slow)
python lmdb_matrix.py verify BXD ./lmdb_store

# Fast verify: checks chain linkage only (quick routine check)
python lmdb_matrix.py verify BXD ./lmdb_store --fast


Export to .geno file

# Default: no header comments, clean output
python lmdb_matrix.py export-genotype BXD ./lmdb_store BXD_exported.geno

# With header comments
python lmdb_matrix.py export-genotype BXD ./lmdb_store BXD_exported.geno --comments

# Specific version
python lmdb_matrix.py export-genotype BXD ./lmdb_store BXD_v2.geno --version 2

Stats (storage analysis)

python lmdb_matrix.py stats BXD ./lmdb_store

expected results:

Dataset: BXD
--------------------------------------------------
Total versions:       12
Full snapshots:       2
Deltas:               10

Total payload:        5662.66 KB
  Full snapshots:     2831.19 KB
  Deltas:             2831.48 KB

Without deltas:       16987.12 KB
Storage savings:      66.7%
--------------------------------------------------

Running Tests

All Tests

guix shell python-wrapper python-lmdb python-numpy python-pytest -- 
python -m pytest tests/ -v

Specific Tests

guix shell python-wrapper python-numpy python-pytest -- 
python -m pytest tests/test_hashing.py -v

TODO's

  • [x] Import a .geno file → stored as version 1 with full snapshot
  • [x] Update with a new .geno → creates version 2, 3, ...
  • [x] Changes → stored as sparse delta or full snapshot
  • [x] Reconstruct any version exactly as it existed
  • [x] Verify hash chain integrity (full or fast mode)
  • [x] Diff two versions to see what changed
  • [x] Export back to .geno format (round-trip verified)
  • [x] Detect tampering — any corrupted payload or broken chain is caught
  • [] add compression for this.
  • [] rework on genotypes with multiple founders
  • [] test by users cc @flisso
  • [] CD integration

References

See the repo for dumping genotypes to LMDB for more options:

(made with skribilo)