Edit this page | Blame

Pangenotypes

Here we discuss different storage solutions for pangenotypes.

GRG format

Looking for graph genotyping I ran into Genotype Representation Graphs GRG

It has a binary storage format that represents something like:

# GRG file example: genotype graph
# Nodes section: NODE <id> <label> allele=<genotype>
NODE 1 GeneA allele=AA
NODE 2 GeneB allele=AG
NODE 3 GeneC allele=GG
NODE 4 GeneD allele=AA
NODE 5 GeneE allele=AG

# Edges section: EDGE <from_id> <to_id>
EDGE 1 2
EDGE 1 3
EDGE 2 4
EDGE 3 4
EDGE 4 5
EDGE 5 1

the tooling

builds with

guix shell -C -N coreutils gcc-toolchain make cmake openssl nss-certs git pkg-config zlib

I did some tests and read the source code. The nice thing is that they have very similar ideas. Unfortunately the implementation is not what we want. I wonder why people alway reinvent data structures :/. To get an idea:

I would like to take similar ideas and take it to an efficient in-memory graph structure that is easily extensible. RDF is key for extensions (and queries). A fast RDF implementation we are going to try is

Toshiaki pointed out we should look at qlever instead:

(made with skribilo)