Implement LMDB-based storage for genotype matrices to support fast, scalable access to large genetoype datasets.
The system should support efficient storage and retrieval of genotype data along with metadata (samples, markers, positions, chromosomes), Include versioning of datasets and cryptographic verification to ensure data integrity across updates.
for quick usage setup :
# Enter environment guix shell python-wrapper python-click python-lmdb python-numpy # Import dataset python lmdb_matrix.py import-genotype file.geno ./lmdb_store # Force overwrite python lmdb_matrix.py import-genotype file.geno ./lmdb_store --force # Update dataset (new version) python lmdb_matrix.py update-genotype DATASET file.geno ./lmdb_store # List datasets python lmdb_matrix.py list-datasets ./lmdb_store # List versions python lmdb_matrix.py list-versions DATASET ./lmdb_store # Verify integrity python lmdb_matrix.py verify DATASET ./lmdb_store # Reconstruct latest python lmdb_matrix.py reconstruct DATASET ./lmdb_store # Reconstruct specific version python lmdb_matrix.py reconstruct DATASET ./lmdb_store --version 2
For detailed explanations, examples, and outputs, see sections below.
You need Guix installed.
guix shell python-wrapper python-click python-lmdb python-numpy python-pytest
guix shell --container --manifest=manifest.scm --network --share=$HOME/genotyping -- bash
#### Basic import
guix shell python-wrapper python-click python-lmdb python-numpy -- python lmdb_matrix.py import-genotype ~/genotype_files/genotype/BXD.geno ./lmdb_store
#### With custom dataset ID and metadata
guix shell python-wrapper python-click python-lmdb python-numpy -- python lmdb_matrix.py import-genotype ~/genotype_files/genotype/BXD.geno ./lmdb_store --dataset-id "BXD_2018" --author "Arthur Centeno" --reason "Initial import from GeneNetwork"
#### Expected output
Parsing BXD.geno... Dataset: BXD Type: riset Founders: ['B', 'D'] Matrix: 7343 markers x 198 samples ✓ Stored as version 1 Dataset ID: BXD_2018 Hash: a3f7b2c8d9e1f456... Storage type: full Timestamp: 2024-01-15T10:30:00Z
guix shell python-wrapper python-click python-lmdb python-numpy -- python lmdb_matrix.py import-genotype ~/genotype_files/genotype/BXD.geno ./lmdb_store
#### Output when dataset exists
Parsing BXD.geno... Dataset: BXD Type: riset Founders: ['B', 'D'] Matrix: 7343 markers x 198 samples ⚠ Dataset 'BXD' already exists! Current version: 3 Hash: b8e4c1d2a5f3e789... Use --force to import anyway, or use update-genotype to create a new version.
If you want to overwrite an existing dataset without creating a new version, use the --force flag.
WARNING: This will replace the existing dataset and may remove previous history.
guix shell python-wrapper python-click python-lmdb python-numpy -- python lmdb_matrix.py import-genotype ~/genotype_files/genotype/BXD.geno ./lmdb_store --force
#### When to use --force
#### When NOT to use it
#### Expected behavior
⚠ Dataset 'BXD' already exists! Forcing overwrite... ✓ Dataset replaced successfully Dataset ID: BXD Version reset to: v1 Storage type: full
| Command | Creates | Version | Storage | History | |------------------|-------------|-------------|----------------------|-------------| | import-genotype | New dataset | v1 (always) | Full matrix | Starts fresh | | update-genotype | New version | v2, v3... | Delta (changes only) | Preserved |
guix shell python-wrapper python-click python-lmdb python-numpy -- python lmdb_matrix.py update-genotype BXD_2018 ~/genotype_files/genotype/BXD_corrected.geno ./lmdb_store --author "qc_pipeline" --reason "QC correction"
hash_v1 = SHA256("MATRIX_V1" + v1_data + null)
hash_v2 = SHA256("MATRIX_V1" + v2_delta + hash_v1)
hash_v3 = SHA256("MATRIX_V1" + v3_delta + hash_v2)