Test pangenome derived genotypes

Here we follow up on the work we did on precompute PublishData:

../systems/mariadb/precompute-publishdata

But now run against pangenome derived genotypes. For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome.

Tasks

[ ] Document lmdb geno and marker information
[ ] Extract epoch information
[ ] Add BED file and link SNPS
[ ] Check MAF filter - it may be too stringent
[ ] Use ravanan/CWL to push to Octopus
[ ] Reintroduce nodes that were not annotated for position (Flavia)
[ ] GWA plotter
[ ] Speed up IO for GEMMA by using lmdb for genotypes and marker file
[ ] Use 1.5LOD score to compute QTLs instead of using 50M distance
[ ] Reduce GEMMA GRM RAM requirements (not urgent)
[ ] Fix -lmm 4 ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt

Summary

To get the mapping and generate the assoc output in mdb format we run a variant of gemma-wrapper.

The workflow essentially is:

capture the significant markers from GEMMA's mdb output (as created by gemma-wrapper)
These are transformed into RDF using the 'gemma-mdb-to-rdf.rb' script
Next we upload that RDF into virtuoso
from there download a table of start-stop data using SPARQL
We compute QTL locations using 'sparql-qtl-detect.rb'
Upload that RDF also into virtuoso

For mapping virtuoso contains four important ttl files:

marker positions in pangenome-marker graph
mapped markers in pangenome-mapped graph
computed QTL positions in pangenome-qtl graph
trait values in traits graph (nyi)

Run gemma with new lmdb output

gemma-batch-run.sh

Convert GEMMA output to RDF

Next we convert that output to RDF with

../bin/gemma-mdb-to-rdf.rb --header > output.ttl
time ../bin/gemma-mdb-to-rdf.rb --anno snps-matched.txt.mdb tmp/panlmm/*-gemma-GWA.tar.xz >> output.ttl
# 5.5 hours for 13K traits
time serdi -i turtle -o ntriples output.ttl > output.n3
real    2m22.336s
user    1m49.381s
sys     0m29.366s
SQL> SPARQL CLEAR GRAPH  <http://pangenome-mapped.genenetwork.org>;
10 min
SQL> SPARQL SELECT count(*) FROM <http://pangenome-mapped.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
126536658
SQL> ld_dir('/export/data/virtuoso/ttl','pangenome-mapped-13K.n3','http://pangenome-mapped.genenetwork.org');
SQL> rdf_loader_run (); // about 1 min per GB n3
SQL> checkpoint;
SQL> SPARQL SELECT count(*) FROM <http://pangenome-mapped.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
179955246

About 5.5 hours for 13K traits and (note that n3 files are less error prone and serdi does better than rapper with huge files) and copy the file to the virtuoso instance and load it with isql (note it may be worth search-replacing the gnt:run tag to something descriptive).

cd /export/guix-containers/virtuoso/data/virtuoso/ttl/
guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
SQL> ld_dir('/export/data/virtuoso/ttl','test-run-3000.n3','http://pan-test.genenetwork.org');
Done. -- 3 msec.
# for testing the validity and optional delete problematic ones:
SQL> SELECT * FROM DB.DBA.load_list;
SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
SQL> DELETE from DB.DBA.LOAD_LIST where LL_STATE = 1;
# commit changes
SQL> rdf_loader_run (); // about 1 min per GB n3
SQL> checkpoint;
Done. -- 16 msec.
SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
34200686

Note it may be a good idea to drop graphs first. That is why we have separate subgraph spaces for every large TTL file:

log_enable(3,1);
SQL> SPARQL CLEAR GRAPH  <http://pan-test.genenetwork.org>;
SQL> SPARQL CLEAR GRAPH  <http://pan-mapped.genenetwork.org>; // 10 min
SQL> SPARQL CLEAR GRAPH  <http://pangenome-marker.genenetwork.org>;
SQL> ld_dir('/export/data/virtuoso/ttl','pangenome-markers.n3','http://pangenome-marker.genenetwork.org');
SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;

For pangenomes we have a marker file, a QTL file

As a test, fetch a table of the traits with their SNPs

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>

SELECT * FROM <http://pangenome-mapped.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test .
?snp gnt:mappedSnp ?traitid ;
        gnt:locus ?locus ;
        gnt:lodScore ?lod ;
        gnt:af ?af .
?locus rdfs:label ?nodeid ;
         gnt:chr ?chr ;
         gnt:pos ?pos .
FILTER (contains(?nodeid,"Marker") && ?pos < 1000)
} LIMIT 100

OK, we are ready to run a little workflow. First create a sorted list of IDs.

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>

SELECT DISTINCT ?trait FROM <http://pangenome-mapped.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId ?trait.
}

Prior work

For the first traits (presented at CTC'25) gemma was run as

echo "[$(date)] Starting kinship matrix calculation for PERCENTILE..."
gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
        -p ${PHENO_FILE} \
              -gk \
              -o percentile_result > percentile.kinship.143.txt

echo "[$(date)] Kinship matrix calculation completed for PERCENTILE."
echo "[$(date)] Starting association analysis for PERCENTILE..."
gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
        -p ${PHENO_FILE} \
              -k ./output/percentile_result.cXX.txt \
              -lmm 4 \
              -maf 0.05 \
              -o percentile_association > percentile.assoc.143.txt

Note no LOCO.

The genotype BIMBAM file is 45G uncompressed. Even though GEMMA does not load everything in RAM, it is a bit large for my workstation. I opted to use tux04 since no one is using it. Had to reboot the machine because it is unreliable and had crashed.

There I rebuilt gemma and set up a first run:

tux04:/export/data/wrk/iwrk/opensource/code/genetics/gemma/tmp$
/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -gk

Without LOCO this took about 18 minutes (186% CPU), 110Gb of RAM. We ought to work on this ;) Next

/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -k output/result.cXX.txt -lmm 9 -maf 0.05

To run gemma on the current 23M BXD pangenome derived genotypes takes 2.5 hours (@ 200% CPU). That is a bit long :). 13K traits would be 43 months on a single machine. We'll need something better. As Rob writes:

The huge majority of variants will have r2 of 1 with hundreds ir thousands of neighbors. This is just a monster distraction. We just want proximal and distal haplotype boundaries for each BXD. Then we want to layer on the weird non-SNP variants and inversions.

A few days later I had to rerun gemma because the output was wrong (I should have checked!). It shows:

chr     rs      ps      n_miss  allele1 allele0 af      beta    se      logl_H1 l_remle l_mle   p_wald  p_lrt   p_score
-9      A1-0    -9      0       A       T       0.171   -nan    -nan    -nan    1.000000e+05    1.000000e+05    -nan  -nan     -nan
-9      A2-0    -9      0       A       T       0.170   -nan    -nan    -nan    1.000000e+05    1.000000e+05    -nan  -nan     -nan

Turns out I was using the wrong pheno file. Let's try again.

/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 10354082_143.list.pang.txt -k output/result.cXX.txt -lmm 9 -maf 0.05

As a check I can diff against the original output. So, I replicated the original run! It also ran faster at 400% CPU in 35 minutes.

(btw tux04 crashed, so I upgraded the BIOS and iDRAC remotely, let's see if this improves things).

Moving to gemma-wrapper

gemma-wrapper has extra facilities, such as LOCO and caching and lmdb output. Last time we used it in

../genetics/systems/mariadb/precompute-publishdata

in a guix container it looked like

#! /bin/env sh

export TMPDIR=./tmp
curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
jq ".[] | .Id" < bxd-publish.json > ids.txt
./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json

for id in 'cat ids.txt' ; do
  echo Precomputing $id
  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
    ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
    ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
  fi
done

Let's try running the big stuff instead:

./bin/gemma-wrapper --force --json --loco -- -g tmp/143samples.percentile.bimbam.bimbam.gz -p tmp/143samples.percentile.bimbam.pheno.gz  -gk

Individuals

gemma does not really track individuals. The order of genotype columns should just be the same as in the pheno file. In this case a sample list is provided and we'll generate a geno-json version that we can give to gemma-wrapper. Basically such a file lists the following

{
  "type": "gn-geno-to-gemma",
  "genofile": "BXD.geno",
  "samples": [
    "BXD1",
    "BXD2",
    "BXD5",
...
  ],
  "numsamples": 237,
  "header": [
    "# File name: BXD_experimental_DGA_7_Dec_2021",
...

To get this

cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e s,$,\"\,,| cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e "s,$,\"\\,," > bxd_inds.list.txt
"BXD100",
"BXD101",
"BXD102",

Next I turned it into a JSON file by hand as 'bxd_inds.list.json'.

Markers

With GEMMA marker names are listed in the geno file. GEMMA also can use a SNP file that gives the chromosome and location. Without the SNP filegemma-wrapper complains it needs the SNP/marker annotation file. This is logical because for LOCO it needs to know what chromosome a marker is on.

The next step is to take the nodes file that and extract all rows from the genotype file that match nodes with chromosomes defined. Andrea is going to deliver all positions for all nodes, but for now we can use what we have. Currently we have nodes annotated in mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt:

mm10#1#chr3     23209565        93886997
mm10#1#chr3     23209564        93886999
mm10#1#chr3     23209563        93887016
...

In the genotype file we find, for example

A23209564-0, A, T, 1.919141867395325,  0.9306930597711228,  1.8201319833577734,  0.7607260422339468,  1.427392726736106,  1.2310230984252724,  1.6633662444541875,  0.6105610229068721, ...

bit funny, but you get the idea. So we can take the mm10 file and write out the genotype file again for all matching nodes with a matching SNP file that should contain for this node:

A23209564-0        93886999        3

To rewrite above mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt file we can do something like

#! ruby

ARGF.each_line do |line|
  tag,name,pos = line.strip.split(/\t/)
  tag =~ /chr(.*)$/
  chrom = $1
  print "A#{name}-0\t#{pos}\t#{chrom}\n"
end

Now, another problem is that not all SNPs have a position in the genotype file (yet). As we can't display them I can drop them at this stage. So we take the SNP file and rewrite the BIMBAM file using that information. That throwaway script looks like

bimbamfn = ARGV.shift
snpfn = ARGV.shift
snps = {}
open(snpfn).each_line do |snpl|
  name = snpl.split(/\t/)[0]
  snps  [name] = 1
end
open(bimbamfn).each_line do |line|
  marker = line.split(/[,\s]/)[0]
  if snps[marker]
    print line
  end
end

takes a while to run, but as this is a one-off that does not matter. Reducing the file leads to 13667900 markers with genotypes. The original SNP file has 14927024 lines. Hmmm. The overlap is therefor not perfect (we have more annotations than genotypes now). To check this I'll run a diff.

cut -f 1 -d "," 143samples.percentile.bimbam.bimbam-reduced > 143samples.percentile.bimbam.bimbam-reduced-markers
sort 143samples.percentile.bimbam.bimbam-reduced-markers > markers-sorted.txt
diff --speed-large-files  143samples.percentile.bimbam.bimbam-reduced-markers markers-sorted.txt
< A80951-0
< A80952-0
< A80953-0
...
cut -f 1 snps.txt |sort > snps-col1-sorted.txt
diff --speed-large-files snps-col1-sorted.txt markers-sorted.txt
241773d228996
< A10314686-0
241777d228999
< A10314689-0
241781d229002
< A10314692-0
grep A10314686 snps-col1-sorted.txt markers-sorted.txt
snps-col1-sorted.txt:A10314686-0
snps-col1-sorted.txt:A10314686-0
markers-sorted.txt:A10314686-0

Ah, we have duplicate annotation lines in the SNP file.

grep A10314686-0 snps.txt
A10314686-0     20257882        8
A10314686-0     20384895        8
grep A10314692-0 snps.txt
A10314692-0     20257575        8
A10314692-0     20384588        8

so, the same node is considered two snps. This is due to the node covering multiple inds (paths). Turns out a chunk of them map on different chromosomes too. I think we ought to drop them until we have a better understanding of what they represent (they may be mismapping artifacts).

I updated the script. Now I see it skips A280000 because there is no marker annotation for that node. Good. Also the number of genotype markers got further reduced to 13209385. I checked the gemma code and the SNP annotation file should match the genotype file line for line. Usurprising, perhaps, but now I need to rewrite both. After adapting the script we now have to files with the same number of lines.

Rerunning with the new files:

gemma -g new-genotypes.txt -p pheno_filtered_143.txt -gk
gemma -g new-genotypes.txt -p pheno_filtered_143.txt -k output/result.cXX.txt -maf 0.05 -lmm 4 -a snps-matched.txt

And, even though the results differ somewhat in size -- due to the different number of markers -- the results look very similar to what was produced before. Good!

Now we have confirmation and all the pieces we can run the same set with gemma-wrapper and LOCO.

gemma-wrapper

The first 'challenge' is that gemma-wrapper computes hash values using a Ruby lib which is rather slow. This is also something we encounter in guix. I replaced that by using our pfff hashing for larger files.

/bin/time -v ../bin/gemma-wrapper --json --loco --jobs 8 -v -- -g new-genotypes.txt -p pheno_filtered_143.txt -gk -a snps-matched.txt > K.json

For this computation each gemma maxed out at 80Gb RAM (total 640Gb). We are really hitting limits here. In the near future we need to check why so much data is retained. As we only have 150 individuals it is a marker thing.

/bin/time -v ../bin/gemma-wrapper -v --json --lmdb --loco --input K.json -- -g new-genotypes.txt -p pheno_filtered_143.txt -a snps-matched.txt -debug -maf 0.05 -lmm 9 > GWA.json

This time gemma requires only 25Gb per chromosome, so we can run it in one go in RAM on this large server. Much of the time is spent in IO, so I think that when we start using mmap (lmdb) we can speed it up significantly. gemma-wrapper has a wall clock time of 10 minutes utilizing 17 cores.

Some chromosomes failed with 'ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt2'. Running the same with -lmm 9 passed. I'll need to keep an eye on that one.

After some fixes we now have loco in an lmdb output. The mdb file comes in at 693Mb. That will make 9TB for 13K traits. Storing the full vector is probably not wise here (and arguably we won't ever use it at this size - we should use the smoothed haplotypes). Only storing the significant values (4.0) made the size 17Mb. That makes it 215Gb total. Which is manageable. I made it even smaller by removing the (superfluous) hits from the metadata. Now down to 7Mb and 3.2Mb compressed. That'll total less than 100Gb for 13K traits. Good.

Final hookup

Now gemma-wrapper works (and test results are confirmed) we have to wire it up to fetch traits from the DB. We also have to make sure the trait values align with the individuals in the genotype file. Earlier I was running the script gemma-batch-run.sh:

https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma-batch-run.sh

which looks like:

export TMPDIR=./tmp
curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
jq ".[] | .Id" < bxd-publish.json > ids.txt
# ---- Compute GRM
./bin/gemma-wrapper --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json

# ---- For all entries run LMM
for id in 'cat ids.txt' ; do
  echo Precomputing $id
  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
    ./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --phenotypes pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
  fi
done

We already have ids.txt and the GRM. What is required is the trait values from the DB. What we need to do is run gn-guile somewhere with access to the DB. Also I need to make sure the current gemma-wrapper tar-balls up the result.

OK, we are running. Looks like the smaller datasets only use 11Gb RES RAM per chromosome. Which means we can run two computes in parallel on this machine.

The first run came through! I forgot the --reduce flag, so it came as 190Mb. I'll fix that. 34 individuals ran in 7 minutes. We are currently runnings at a trait in 6 min. We can double that on this machine.

The following puzzles me a bit

## number of analyzed individuals = 31
## number of covariates = 1
## number of phenotypes = 1
## leave one chromosome out (LOCO) =       14
## number of total SNPs/var        = 13209385
## number of SNPS for K            = 12322657
## number of SNPS for GWAS         =   886728
## number of analyzed SNPs         = 13122153

why is the number of SNPs for GWAS low? Perhaps a threshold of 10% for maf is a bit stringent. See below.

Anyway, we are running traits and the first 500 we'll use for analysis.

Meanwhile I'll look at deploying on octopus and maybe speeding up GEMMA. See

issues/genetics/speeding-up-gemma

MAF

GEMMA has a MAF filter. For every SNP a maf is computed by adding the geno value:

maf += geno

when all genotype values are added up MAF is divided by 2x the number of individuals (minus missing).

maf /= 2.0 * (double)(ni_test - n_miss);

and this is held against the maf passed on the command line. The 2.0 therefore assumes all values are between 0 and 2.

Actually I now realise we are using LOCO. So the number of SNPs are the ones on one chromosome. That makes sense! Still we have to be careful about the MAF range. In our genotype file the values are between 0 and 2. So that is fine in itself.

RDF

Next step is to generate RDF. The SNP annotation was slow, so I moved that to lmdb. Parsing 400 traits now takes 3 minutes. The RDF file is under 1Gb and the SNP annotation RDF is 330Mb. Not too bad!

guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-snps-400.n3','http://pan-test.genenetwork.org');
Done. -- 3 msec.
# for testing the validity and optional delete problematic ones:
SQL> SELECT * FROM DB.DBA.load_list;
SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
# commit changes
SQL> rdf_loader_run ();
SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-400.n3','http://pan-test.genenetwork.org');
SQL> rdf_loader_run ();
SQL> checkpoint;
Done. -- 16 msec.
SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
34200686

Or in the web interface:

SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o }

Query

The RDF is formed as:

gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedTrait;
      rdfs:label "GEMMA BXDPublish trait 10383 mapped with LOCO (defaults)";
      gnt:trait gn:publishXRef_10383;
      gnt:loco true;
      gnt:run gn:test;
      gnt:time "2025/11/10 08:12";
      gnt:belongsToGroup gn:setBxd;
      gnt:name "BXDPublish";
      gnt:traitId "10383";
      gnt:nind 14;
      gnt:mean 18.0;
      gnt:std 10.9479;
      gnt:skew 0.3926;
      gnt:kurtosis -1.1801;
      skos:altLabel "BXD_10383";
      gnt:filename "0233fa0cf277ee7d749de08b32f97c8be6478639-BXDPublish-10383-gemma-GWA.tar.xz";
      gnt:hostname "napoli";
      gnt:user "wrk".
gn:A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedLocus;
      gnt:mappedSnp gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639;
      gnt:locus gn:A8828461_0;
      gnt:lodScore 4.8;
      gnt:af 0.536;
      gnt:effect -32.859.

and SNPs are annotated as

gn:A8828461_0 a gnt:marker;
                 rdfs:label "A8828461-0";
                 gnt:chr  "1";
                 gnt:pos  3304440.
gn:A8828464_0 a gnt:marker;
                 rdfs:label "A8828464-0";
                 gnt:chr  "1";
                 gnt:pos  3304500.

To get all tested traits you can list:

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>

SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
?trait a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId ?traitid ;
         gnt:kurtosis ?kurtosis .
} limit 100

To get all SNPs for trait "10001"

SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId "10381" .
?snp gnt:mappedSnp ?traitid ;
        gnt:locus ?locus .
?locus rdfs:label ?nodeid ;
         gnt:chr ?chr ;
         gnt:pos ?pos .
}

Lists:

| http://genenetwork.org/id/A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 | "A8828461-0" | "1" | 3304440 |

Scoring/annotating QTL

Next step is annotating the QTL in RDF. Earlier I wrote a script rdf-analyse-gemma-hits. It uses rapper to read two RDF files (two runs) and annotates the QTL and differences between the files. The code is not pretty:

https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/bin/rdf-analyse-gemma-hits.rb#L1

The supporting library is a bit better:

https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/lib/qtlrange.rb#L1

Basically we have a QTL locus (QLocus) that tracks chr,pos,af and lod for each hit. QRange is a set of QLocus which also tracks some stats chr,min,max,snps,max_af,lod. It can compute whether two QTL (QRange) overlap. Next we have a container that tracks the QTL (QRanges) on a chromosome.

Finally there is a diff function that can show the differences on a chromosome (QRanges) for two mapped traits.

Maybe the naming could be a bit better, but the code is clear as it stands. On thing to note is that we use a fixed distance MAX_SNP_DISTANCE_BPS of 50M that decides whether a SNP falls in the same QTL. It would be worth trying to base it on dropping LOD scores (1.5 from the top). Rob and Flavia pointed out.

So, the library is fine, but the calling program is not great. The reason is that I parse RDF directly, teasing apart the logic we do in above SPARQL. I track state in dictionaries (hashes of hashes) and the result ends up convoluted. Also a lot of state in RAM. I chose RDF direct parsing because it makes for easier development. The downside is that I need to parse the whole file to make sure I have everything related to a trait. To fetch SNP results from SPARQL directly is slow too. I am in a bind.

Using curl:

time curl -G http://sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId ?trait ; gnt:kurtosis ?k . }

time curl -G http:///sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId \"10001\" ; gnt:kurtosis ?k . ?snp gnt:mappedSnp ?traitid ; gnt:locus ?locus . }
"  > test.out
real    0m1.612s
user    0m0.020s
sys     0m0.000s

To get the trait info for 400 traits takes a second. So, that is no big deal. To get the 6K SNPs for one trait also takes a second. Hmmm. That takes hours, compared to the minutes for direct RDF parsing. Before lmdb comes to the rescue we should try running in on the virtuoso server itself. For curl we get 0.5s. Which makes it two hours for 13K traits. But when we run the query using isql it runs in 70ms which totals 15 minutes. That is perfectly fine for running the whole set!

One way is to simply script isql from the command line. Meanwhile, it also turns out the ODBC interface can be used from python or ruby. Here an example in R:

https://cran.stat.auckland.ac.nz/web/packages/virtuoso/index.html

Not sure if that is fast enough, but perhaps worth trying.

So, now we have a way to query the data around a trait in seconds. This means I can rewrite the QTL generator to go by trait. This also allows for a quick turnaround during development (good!). Also I want two scripts: one for computing the QTL and one for annotating the differences.

Alright. The first script should simply to fetch a trait with its markers from SPARQL and score the QTL (as RDF output). The new script is at

https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/sparql-qtl-detect.rb

First, the query for one trait looks like:

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>

SELECT ?lod ?af ?nodeid ?chr ?pos FROM <http://pan-test.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId "10002" .
?snp gnt:mappedSnp ?traitid ;
        gnt:locus ?locus ;
        gnt:lodScore ?lod ;
        gnt:af ?af .
?locus rdfs:label ?nodeid ;
         gnt:chr ?chr ;
         gnt:pos ?pos .
} ORDER BY DESC(?lod)

rendering some 22K markers for trait 10002 as a TSV:

"lod"   "af"    "nodeid"        "chr"   "pos"
7.5     0.547   "A13459298-0"   "8"     98658490
7.1     0.154   "A13402313-0"   "8"     96798487
7       0.432   "A13446492-0"   "8"     97355019
7       0.263   "A13387873-0"   "8"     94934820
7       0.585   "A4794343-0"    "1"     172265488
...

Earlier with precompute for trait 10002 we got:

[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}

so the hits are in range, but the LOD may be inflated because of the number of markers. Anyway, this point we are merely concerned with scoring QTL. The first script is simply:

qtls = QTL::QRanges.new("10002","test")
CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
    qlocus = QTL::QLocus.new(hit["nodeid"],hit["chr"],hit["pos"].to_i,hit["af"].to_f,hit["lod"].to_f)
    qtls.add_locus(qlocus)
end
print qtls

and prints a long list of QTL containing a single hit.

[10002,test] =>{"1"=>[#<QRange Chr1 𝚺1 3099543..3099543 AF=0.583 LOD=5.8..5.8>, #<QRange Chr1 𝚺1 65908328..65908328 AF=0.627 LOD=5.7..5.7>, #<QRange Chr1 𝚺1 81604902..81604902 AF=0.451 LOD=5.5..5.5>, #<QRange Chr1 𝚺2 85087169..85087177 AF=0.781 LOD=5.5..5.6>, #<QRange Chr1 𝚺1 93740525..93740525 AF=0.762 LOD=6.5..6.5>, #<QRange Chr1 𝚺1 114086053..114086053 AF=0.568 LOD=5.7..5.7>,...

For trait 10002 tweaking thresholds and rebinning we get

#<QRange Chr8 𝚺2 34.303454..35.675301 AF=0.571 LOD=5.7..5.8>
#<QRange Chr8 𝚺621 91.752748..102.722635 AF=0.663 LOD=5.6..7.5>
#<QRange Chr1 𝚺16 65.908328..175.232335 AF=0.781 LOD=5.6..7.0>
#<QRange Chr4 𝚺5 56.498971..126.135422 AF=0.657 LOD=5.6..6.4>
#<QRange Chr12 𝚺3 23.037869..58.306731 AF=0.643 LOD=5.8..6.2>
#<QRange Chr10 𝚺2 13.442071..13.442088 AF=0.641 LOD=5.8..6.0>
#<QRange Chr10 𝚺3 94.246536..103.438796 AF=0.608 LOD=5.9..6.2>
#<QRange Chr3 𝚺2 47.644513..82.451061 AF=0.548 LOD=5.7..6.2>
#<QRange Chr9 𝚺2 97.445077..120.263403 AF=0.717 LOD=5.8..5.8>
#<QRange Chr11 𝚺2 27.4058..56.30011 AF=0.559 LOD=5.7..5.7>

with a LOD>5.5 cut-off. That seems justified because LOD scores are inflated. Compare this with the earlier mapping using 'traditional' genotypes:

[10002,LOCO] =>{
"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>,
      #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>],
"8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}

we can see the significance of chr8 has gone up with pangenome mapping (relative to chr1) and we find 2 QTL now on chr8, a new one to the left. Chr1 looks similar. We have some other candidates that may or may not be relevant (all narrow!).

Note this *is* a random trait(!) and suggests the landscape of QTLs will change pretty dramatically. Note also that Andrea will give new genotypes and smoothing to follow. But it is encouraging.

I played a bit with the QTL output, and for now settled on tracking nodes that have a LOD>5.0. We drop QTL based on the following:

qtl.lod.max < 6.0 or (qtl.lod.max < 7.5 - qtl.snps.size/2)

I.e. a single SNP QTL has to have a LOD of 7.0. A 2-SNP QTL has to have a LOD of 6.5. This begets

[10002,test] =>{
"1"=>[#<QRange Chr1 𝚺69 3.099543..192.718161 AF=0.781 LOD=5.1..7.0>],
"4"=>[#<QRange Chr4 𝚺12 56.498971..147.86044 AF=0.676 LOD=5.1..6.4>],
"8"=>[#<QRange Chr8 𝚺2774 34.303454..116.023702 AF=0.899 LOD=5.1..7.5>],
"10"=>[#<QRange Chr10 𝚺7 82.334108..105.062097 AF=0.623 LOD=5.1..6.2>],
"12"=>[#<QRange Chr12 𝚺9 21.707644..72.57041 AF=0.77 LOD=5.1..6.2>]}

which are all worth considering (I think). Obviously we could annotate all QTL in RDF triples and filter on that using SPARQL. But this makes processing a bit faster without having to deal with too much noise. We can fine tune later.

Now two more steps to go:

[X] Fetch all mapped traits using SPARQL and write RDF
[X] Compare QTL between datasets and annotate new hits

Fetch all mapped traits

SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId "10002" .
?snp gnt:mappedSnp ?traitid ;
        gnt:locus ?locus ;
        gnt:lodScore ?lod ;
        gnt:af ?af .
?locus rdfs:label ?nodeid ;
         gnt:chr ?chr ;
         gnt:pos ?pos .
} ORDER BY DESC(?lod)

The first step is to fetch this data. Let's try SPARQL over the web first.

Compare QTL sets

The previous code I wrote to compare QTLs essentially walks the QTLs and annotates a new QTL if there is no overlap between the two sets. Again, this code is too convoluted:

https://github.com/genetics-statistics/gemma-wrapper/blob/18e7a3ac8a11becba84325499116621ad095f28e/lib/qtlrange.rb#L190

The principle is straightforward, however. The code for reading the SPARQL output for a trait is

  CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
    trait_id = hit["traitid"] if not trait_id
    lod = hit["lod"].to_f
    if lod > 5.0 # set for pangenome input
      qlocus = QTL::QLocus.new(hit["snp"],hit["chr"],hit["pos"].to_f/10**6,hit["af"].to_f,lod)
      qtls.add_locus(qlocus)
    end
  end

So we can use SPARQL to build two sets on the fly and then run the diff.

Actually, when thinking about this I realised it should not be too hard to do in SPARQL to find the 'new' QTL.

SELECT * WHERE {
?traitid a gnt:mappedTrait ;
            gnt:traitId "10002" .
}
http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d
http://genenetwork.org/id/HK_trait_BXDPublish_10002_gemma_GWA_hk_assoc_txt
http://genenetwork.org/id/GEMMAMapped_test_LOCO_BXDPublish_10002_gemma_GWA_82087f23

lists the three versions of compute for traits. To fetch all QTL for first mapping:

SELECT ?qtl ?lod ?chr ?start ?stop (count(?snp) as ?snps) WHERE {
?traitid a gnt:mappedTrait ;
  gnt:traitId "10002" .
?qtl gnt:mappedQTL ?traitid ;
  gnt:qtlChr ?chr ;
  gnt:qtlStart ?start ;
  gnt:qtlStop ?stop ;
  gnt:qtlLOD ?lod .
?qtl gnt:mappedSnp ?snp .
}

gets 3 QTL. Now I did not store HK in RDF, but to show the filtering principle we can fetch two traits and compare QTL. The following gets two QTL from trait "10002" on CHR1 and holds that against that of trait "10079":

SELECT ?t ?s1 ?e1 ?t2 ?s2 ?e2 WHERE {
  ?traitid a gnt:mappedTrait ;
  gnt:traitId ?t .
  ?qtl gnt:mappedQTL ?traitid ;
  gnt:qtlChr ?chr ;
  gnt:qtlStart ?s1 ;
  gnt:qtlStop ?e1 .
  {
    SELECT * WHERE {
      ?tid a gnt:mappedTrait ;
      gnt:traitId "10079" ;
      gnt:traitId ?t2 .
      ?qtl2 gnt:mappedQTL ?tid ;
      gnt:qtlChr ?chr ;
      gnt:qtlStart ?s2 ;
      gnt:qtlStop ?e2 .
    }
  }
  FILTER (?t = "10002") .
} LIMIT 10

"10002",171.172,183.154,"10079",172.235,172.235
"10002",72.2551,73.3771,"10079",172.235,172.235

Note we pivot on two traits and one chromosome, so we find all pairs. To say if a QTL is *new* or different we can add another FILTER

FILTER ((?s2 > ?s1 && ?e2 > ?e1) || (?s2 < ?s1 && ?e2 < ?e1)) .
"t","s1","e1","t2","s2","e2"
"10002",72.2551,73.3771,"10079",172.235,172.235

that says that this ?qtl2 does not overlap with ?qtl. I.e. here it is a new QTL!

This new insight means we should should store *all* QTL in RDF, including the single SNP ones, because it is easy to filter on them. Note that there may be a more elegant way to query traits pairwise. This is just the first thing that worked. It may need more tuning if there are more than two QTL on a chromosome. E.g. the comparison between 10002 and 10413 finds:

"t","s1","e1","t2","s2","e2"
"10002",72.2551,73.3771,"10413",32.3113,42.4624
"10002",171.172,183.154,"10413",171.04,171.041
"10002",171.172,183.154,"10413",32.3113,42.4624
"10002",72.2551,73.3771,"10413",171.04,171.041

I.e. it does find new QTL here and you still need to do a little set analysis. In words you should be able to "remove all overlapping QTL from a chromosome". Maybe we can filter the other way - select overlapping QTL and remove those from the result set.

BIND ((?s2 >= ?s1 && ?e2 <= ?e1) || (?s1 >= ?s2 && ?e1 <= ?e2) as ?overlap) .
"10002",171.172,183.154,"10079",172.235,172.235,1
"10002",72.2551,73.3771,"10079",172.235,172.235,0

now drop all ?t's that are overlapping. It appears to work with:

https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/show-qtls-two-traits.sparql

I'll need to test it on the pangenome set.

Listing QTL

To get all QTL from a run you can use something like

SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE {
  ?traitid a gnt:mappedTrait ;
  gnt:traitId ?t .
  MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL
  # ?traitid gnt:run gn:test . # use if you want the new QTL
  ?qtl gnt:mappedQTL ?traitid ;
  gnt:qtlChr ?chr ;
  gnt:qtlLOD ?lod ;
  gnt:qtlStart ?s ;
  gnt:qtlStop ?e .
  ?qtl gnt:mappedSnp ?snp .
  FILTER (?t = "10002" && ?lod >= 5.0 ) .
} LIMIT 100

Note we filter on a trait name and LOD score.

For panQTL (gnt:run == gn:test) this results in

"t"	"lod"	"snps"	"chr"	"start"	"end"
"10002"	6.4	3	"15"	87.671663	98.028911
"10002"	6.4	12	"4"	56.498971	147.86044
"10002"	7	69	"1"	3.099543	192.718161
"10002"	7.5	2774	"8"	34.303454	116.023702
"10002"	6.2	7	"10"	82.334108	105.062097
"10002"	6.2	2	"3"	47.644513	82.451061
"10002"	6.2	1	"3"	130.145235	130.145235
"10002"	6	2	"10"	13.442071	13.442088
"10002"	6.2	9	"12"	21.707644	72.57041

For the traditional genotypes (gnt:run != gn:test)

"t"	"lod"	"snps"	"chr"	"start"	"end"
"10002"	5.3	91	"1"	171.172	183.154
"10002"	5.1	15	"1"	72.2551	73.3771

Listing SNPs

Now we have all QTLs in the DB, as well as underlying SNPs, one interesting question to ask is what SNPs are repeated across our traits. This, if you remember, is the key idea of reversed genetics. Of course, with our pangenome-derived genotypes, we now have thousands of SNPs per trait. Let's see if we can rank them by number of traits.

For our 1000 traits we map about 7.7M snps with a LOD>5

Using sparql from emacs

Note: if you are doing SPARQL quite a bit, I recommend using sparql-mode in emacs! It is easy, faster and you can use git :)

https://github.com/ljos/sparql-mode

M-x sparql-query-region [ENTER] http://sparql-test.genenetwork.org/sparql/ [ENTER]