Pangenotypes

Here we discuss different storage solutions for pangenotypes.

GRG format

Looking for graph genotyping I ran into Genotype Representation Graphs GRG

https://pmc.ncbi.nlm.nih.gov/articles/PMC11071416/

It has a binary storage format that represents something like:

# GRG file example: genotype graph
# Nodes section: NODE <id> <label> allele=<genotype>
NODE 1 GeneA allele=AA
NODE 2 GeneB allele=AG
NODE 3 GeneC allele=GG
NODE 4 GeneD allele=AA
NODE 5 GeneE allele=AG

# Edges section: EDGE <from_id> <to_id>
EDGE 1 2
EDGE 1 3
EDGE 2 4
EDGE 3 4
EDGE 4 5
EDGE 5 1

the tooling

https://github.com/aprilweilab/grgl.git

builds with

guix shell -C -N coreutils gcc-toolchain make cmake openssl nss-certs git pkg-config zlib

I did some tests and read the source code. The nice thing is that they have very similar ideas. Unfortunately the implementation is not what we want. I wonder why people alway reinvent data structures :/. To get an idea:

https://github.com/aprilweilab/grgl/blob/main/src/serialize.cpp

I would like to take similar ideas and take it to an efficient in-memory graph structure that is easily extensible. RDF is key for extensions (and queries). A fast RDF implementation we are going to try is

https://pyoxigraph.readthedocs.io/en/stable/index.html

Toshiaki pointed out we should look at qlever instead:

https://github.com/ad-freiburg/qlever