Here we discuss different storage solutions for pangenotypes.
Looking for graph genotyping I ran into Genotype Representation Graphs GRG
It has a binary storage format that represents something like:
# GRG file example: genotype graph # Nodes section: NODE <id> <label> allele=<genotype> NODE 1 GeneA allele=AA NODE 2 GeneB allele=AG NODE 3 GeneC allele=GG NODE 4 GeneD allele=AA NODE 5 GeneE allele=AG # Edges section: EDGE <from_id> <to_id> EDGE 1 2 EDGE 1 3 EDGE 2 4 EDGE 3 4 EDGE 4 5 EDGE 5 1
the tooling
builds with
guix shell -C -N coreutils gcc-toolchain make cmake openssl nss-certs git pkg-config zlib
I did some tests and read the source code. The nice thing is that they have very similar ideas. Unfortunately the implementation is not what we want. I wonder why people alway reinvent data structures :/. To get an idea:
I would like to take similar ideas and take it to an efficient in-memory graph structure that is easily extensible. RDF is key for extensions (and queries). A fast RDF implementation we are going to try is
Toshiaki pointed out we should look at qlever instead: