Implement LMDB-based storage for genotype matrices to support fast, scalable access to large genotype datasets.
The system should support efficient storage and retrieval of genotype data along with metadata (samples, markers, positions, chromosomes). It should also include versioning of datasets and cryptographic verification to ensure data integrity across updates.
for quick usage setup:
# Enter environment guix shell python-wrapper python-click python-lmdb python-numpy # Import dataset python lmdb_matrix.py import-genotype file.geno ./lmdb_store # Force overwrite python lmdb_matrix.py import-genotype file.geno ./lmdb_store --force # Update dataset (new version) python lmdb_matrix.py update-genotype DATASET file.geno ./lmdb_store # List datasets python lmdb_matrix.py list-datasets ./lmdb_store # List versions python lmdb_matrix.py list-versions DATASET ./lmdb_store # Verify integrity python lmdb_matrix.py verify DATASET ./lmdb_store # Reconstruct latest python lmdb_matrix.py reconstruct DATASET ./lmdb_store # Reconstruct specific version python lmdb_matrix.py reconstruct DATASET ./lmdb_store --version 2
For detailed explanations, examples, and outputs, see sections below.
You need Guix installed.
guix shell python-wrapper python-click python-lmdb python-numpy python-pytest
guix shell --container --manifest=manifest.scm --network --share=$HOME/genotyping -- bash
#### Basic import
guix shell python-wrapper python-click python-lmdb python-numpy -- python lmdb_matrix.py import-genotype ~/genotype_files/genotype/BXD.geno ./lmdb_store
#### With custom dataset ID and metadata
guix shell python-wrapper python-click python-lmdb python-numpy -- python lmdb_matrix.py import-genotype ~/genotype_files/genotype/BXD.geno ./lmdb_store --dataset-id "BXD_2018" --author "Arthur Centeno" --reason "Initial import from GeneNetwork"
#### Expected output
Parsing BXD.geno... Dataset: BXD Type: riset Founders: ['B', 'D'] Matrix: 7343 markers x 198 samples ✓ Stored as version 1 Dataset ID: BXD_2018 Hash: a3f7b2c8d9e1f456... Storage type: full Timestamp: 2024-01-15T10:30:00Z
guix shell python-wrapper python-click python-lmdb python-numpy -- python lmdb_matrix.py import-genotype ~/genotype_files/genotype/BXD.geno ./lmdb_store
#### Output when dataset exists
Parsing BXD.geno... Dataset: BXD Type: riset Founders: ['B', 'D'] Matrix: 7343 markers x 198 samples ⚠ Dataset 'BXD' already exists! Current version: 3 Hash: b8e4c1d2a5f3e789... update-genotype to create a new version.
python lmdb_matrix.py update-genotype BXD file.geno ./lmdb_store \ --author "qc_pipeline" \ --reason "QC correction"
| Command | Creates | Version | Storage | History | |------------------|-------------|-------------|----------------------|-------------| | import-genotype | New dataset | v1 (always) | Full matrix | Starts fresh | | update-genotype | New version | v2, v3... | Delta (changes only) | Preserved |
guix shell python-wrapper python-click python-lmdb python-numpy -- python lmdb_matrix.py update-genotype BXD_2018 ~/genotype_files/genotype/BXD_corrected.geno ./lmdb_store --author "qc_pipeline" --reason "QC correction"
# Full verify: recomputes all hashes from payloads (thorough but slow) python lmdb_matrix.py verify BXD ./lmdb_store # Fast verify: checks chain linkage only (quick routine check) python lmdb_matrix.py verify BXD ./lmdb_store --fast
# Default: no header comments, clean output python lmdb_matrix.py export-genotype BXD ./lmdb_store BXD_exported.geno # With header comments python lmdb_matrix.py export-genotype BXD ./lmdb_store BXD_exported.geno --comments # Specific version python lmdb_matrix.py export-genotype BXD ./lmdb_store BXD_v2.geno --version 2
python lmdb_matrix.py stats BXD ./lmdb_store
expected results:
Dataset: BXD -------------------------------------------------- Total versions: 12 Full snapshots: 2 Deltas: 10 Total payload: 5662.66 KB Full snapshots: 2831.19 KB Deltas: 2831.48 KB Without deltas: 16987.12 KB Storage savings: 66.7% --------------------------------------------------
guix shell python-wrapper python-lmdb python-numpy python-pytest -- python -m pytest tests/ -v
guix shell python-wrapper python-numpy python-pytest -- python -m pytest tests/test_hashing.py -v
See the repo for dumping genotypes to LMDB for more options: