Here we follow up on the work we did on precompute PublishData:
But now run against pangenome derived genotypes. For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome.
To get the mapping and generate the assoc output in mdb format we run a variant of gemma-wrapper.
The workflow essentially is:
For mapping virtuoso contains four important ttl files:
gemma-batch-run.sh
Next we convert that output to RDF with
../bin/gemma-mdb-to-rdf.rb --header > output.ttl time ../bin/gemma-mdb-to-rdf.rb --anno snps-matched.txt.mdb tmp/panlmm/*-gemma-GWA.tar.xz >> output.ttl # two hours for 7000 traits time serdi -i turtle -o ntriples output.ttl > output.n3
(note that n3 files are less error prone and serdi does better than rapper with huge files) and copy the file to the virtuoso instance and load it with isql (note it may be worth search-replacing the gnt:run tag to something descriptive).
cd /export/guix-containers/virtuoso/data/virtuoso/ttl/
guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
SQL> ld_dir('/export/data/virtuoso/ttl','test-run-3000.n3','http://pan-test.genenetwork.org');
Done. -- 3 msec.
# for testing the validity and optional delete problematic ones:
SQL> SELECT * FROM DB.DBA.load_list;
SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
SQL> DELETE from DB.DBA.LOAD_LIST where LL_STATE = 1;
# commit changes
SQL> rdf_loader_run (); // about 1 min per GB n3
SQL> checkpoint;
Done. -- 16 msec.
SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
34200686
Note it may be a good idea to drop graphs first. That is why we have separate subgraph spaces for every large TTL file:
log_enable(3,1);
SQL> SPARQL CLEAR GRAPH <http://pan-test.genenetwork.org>;
SQL> SPARQL CLEAR GRAPH <http://pan-mapped.genenetwork.org>; // 10 min
SQL> SPARQL CLEAR GRAPH <http://pangenome-marker.genenetwork.org>;
SQL> ld_dir('/export/data/virtuoso/ttl','pangenome-markers.n3','http://pangenome-marker.genenetwork.org');
SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
For pangenomes we have a marker file, a QTL file
As a test, fetch a table of the traits with their SNPs
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT * FROM <http://pangenome-mapped.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
gnt:run gn:test .
?snp gnt:mappedSnp ?traitid ;
gnt:locus ?locus ;
gnt:lodScore ?lod ;
gnt:af ?af .
?locus rdfs:label ?nodeid ;
gnt:chr ?chr ;
gnt:pos ?pos .
FILTER (contains(?nodeid,"Marker") && ?pos < 1000)
} LIMIT 100
OK, we are ready to run a little workflow. First create a sorted list of IDs.
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT DISTINCT ?trait FROM <http://pangenome-mapped.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
gnt:run gn:test ;
gnt:traitId ?trait.
}
See also
Sort that list and save as 'pan-ids-sorted.txt'. Next run
../../bin/workflow/qtl-detect-batch-run.sh
and load those in virtuoso. List new QTL
SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE {
?traitid a gnt:mappedTrait ;
gnt:traitId ?t .
MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL
# ?traitid gnt:run gn:test . # use if you want the new QTL
?qtl gnt:mappedQTL ?traitid ;
gnt:qtlChr ?chr ;
gnt:qtlLOD ?lod ;
gnt:qtlStart ?s ;
gnt:qtlStop ?e .
?qtl gnt:mappedSnp ?snp .
FILTER (?t = "10002" && ?lod >= 5.0 ) .
} LIMIT 100
For the first traits (presented at CTC'25) gemma was run as
echo "[$(date)] Starting kinship matrix calculation for PERCENTILE..."
gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
-p ${PHENO_FILE} \
-gk \
-o percentile_result > percentile.kinship.143.txt
echo "[$(date)] Kinship matrix calculation completed for PERCENTILE."
echo "[$(date)] Starting association analysis for PERCENTILE..."
gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
-p ${PHENO_FILE} \
-k ./output/percentile_result.cXX.txt \
-lmm 4 \
-maf 0.05 \
-o percentile_association > percentile.assoc.143.txt
Note no LOCO.
The genotype BIMBAM file is 45G uncompressed. Even though GEMMA does not load everything in RAM, it is a bit large for my workstation. I opted to use tux04 since no one is using it. Had to reboot the machine because it is unreliable and had crashed.
There I rebuilt gemma and set up a first run:
tux04:/export/data/wrk/iwrk/opensource/code/genetics/gemma/tmp$ /bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -gk
Without LOCO this took about 18 minutes (186% CPU), 110Gb of RAM. We ought to work on this ;) Next
/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -k output/result.cXX.txt -lmm 9 -maf 0.05
To run gemma on the current 23M BXD pangenome derived genotypes takes 2.5 hours (@ 200% CPU). That is a bit long :). 13K traits would be 43 months on a single machine. We'll need something better. As Rob writes:
The huge majority of variants will have r2 of 1 with hundreds ir thousands of neighbors. This is just a monster distraction. We just want proximal and distal haplotype boundaries for each BXD. Then we want to layer on the weird non-SNP variants and inversions.
A few days later I had to rerun gemma because the output was wrong (I should have checked!). It shows:
chr rs ps n_miss allele1 allele0 af beta se logl_H1 l_remle l_mle p_wald p_lrt p_score -9 A1-0 -9 0 A T 0.171 -nan -nan -nan 1.000000e+05 1.000000e+05 -nan -nan -nan -9 A2-0 -9 0 A T 0.170 -nan -nan -nan 1.000000e+05 1.000000e+05 -nan -nan -nan
Turns out I was using the wrong pheno file. Let's try again.
/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 10354082_143.list.pang.txt -k output/result.cXX.txt -lmm 9 -maf 0.05
As a check I can diff against the original output. So, I replicated the original run! It also ran faster at 400% CPU in 35 minutes.
(btw tux04 crashed, so I upgraded the BIOS and iDRAC remotely, let's see if this improves things).
gemma-wrapper has extra facilities, such as LOCO and caching and lmdb output. Last time we used it in
in a guix container it looked like
#! /bin/env sh
export TMPDIR=./tmp
curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
jq ".[] | .Id" < bxd-publish.json > ids.txt
./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
for id in 'cat ids.txt' ; do
echo Precomputing $id
if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
fi
done
Let's try running the big stuff instead:
./bin/gemma-wrapper --force --json --loco -- -g tmp/143samples.percentile.bimbam.bimbam.gz -p tmp/143samples.percentile.bimbam.pheno.gz -gk
gemma does not really track individuals. The order of genotype columns should just be the same as in the pheno file. In this case a sample list is provided and we'll generate a geno-json version that we can give to gemma-wrapper. Basically such a file lists the following
{
"type": "gn-geno-to-gemma",
"genofile": "BXD.geno",
"samples": [
"BXD1",
"BXD2",
"BXD5",
...
],
"numsamples": 237,
"header": [
"# File name: BXD_experimental_DGA_7_Dec_2021",
...
To get this
cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e s,$,\"\,,| cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e "s,$,\"\\,," > bxd_inds.list.txt "BXD100", "BXD101", "BXD102",
Next I turned it into a JSON file by hand as 'bxd_inds.list.json'.
With GEMMA marker names are listed in the geno file. GEMMA also can use a SNP file that gives the chromosome and location. Without the SNP filegemma-wrapper complains it needs the SNP/marker annotation file. This is logical because for LOCO it needs to know what chromosome a marker is on.
The next step is to take the nodes file that and extract all rows from the genotype file that match nodes with chromosomes defined. Andrea is going to deliver all positions for all nodes, but for now we can use what we have. Currently we have nodes annotated in mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt:
mm10#1#chr3 23209565 93886997 mm10#1#chr3 23209564 93886999 mm10#1#chr3 23209563 93887016 ...
In the genotype file we find, for example
A23209564-0, A, T, 1.919141867395325, 0.9306930597711228, 1.8201319833577734, 0.7607260422339468, 1.427392726736106, 1.2310230984252724, 1.6633662444541875, 0.6105610229068721, ...
bit funny, but you get the idea. So we can take the mm10 file and write out the genotype file again for all matching nodes with a matching SNP file that should contain for this node:
A23209564-0 93886999 3
To rewrite above mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt file we can do something like
#! ruby
ARGF.each_line do |line|
tag,name,pos = line.strip.split(/\t/)
tag =~ /chr(.*)$/
chrom = $1
print "A#{name}-0\t#{pos}\t#{chrom}\n"
end
Now, another problem is that not all SNPs have a position in the genotype file (yet). As we can't display them I can drop them at this stage. So we take the SNP file and rewrite the BIMBAM file using that information. That throwaway script looks like
bimbamfn = ARGV.shift
snpfn = ARGV.shift
snps = {}
open(snpfn).each_line do |snpl|
name = snpl.split(/\t/)[0]
snps [name] = 1
end
open(bimbamfn).each_line do |line|
marker = line.split(/[,\s]/)[0]
if snps[marker]
print line
end
end
takes a while to run, but as this is a one-off that does not matter. Reducing the file leads to 13667900 markers with genotypes. The original SNP file has 14927024 lines. Hmmm. The overlap is therefor not perfect (we have more annotations than genotypes now). To check this I'll run a diff.
cut -f 1 -d "," 143samples.percentile.bimbam.bimbam-reduced > 143samples.percentile.bimbam.bimbam-reduced-markers sort 143samples.percentile.bimbam.bimbam-reduced-markers > markers-sorted.txt diff --speed-large-files 143samples.percentile.bimbam.bimbam-reduced-markers markers-sorted.txt < A80951-0 < A80952-0 < A80953-0 ... cut -f 1 snps.txt |sort > snps-col1-sorted.txt diff --speed-large-files snps-col1-sorted.txt markers-sorted.txt 241773d228996 < A10314686-0 241777d228999 < A10314689-0 241781d229002 < A10314692-0 grep A10314686 snps-col1-sorted.txt markers-sorted.txt snps-col1-sorted.txt:A10314686-0 snps-col1-sorted.txt:A10314686-0 markers-sorted.txt:A10314686-0
Ah, we have duplicate annotation lines in the SNP file.
grep A10314686-0 snps.txt A10314686-0 20257882 8 A10314686-0 20384895 8 grep A10314692-0 snps.txt A10314692-0 20257575 8 A10314692-0 20384588 8
so, the same node is considered two snps. This is due to the node covering multiple inds (paths). Turns out a chunk of them map on different chromosomes too. I think we ought to drop them until we have a better understanding of what they represent (they may be mismapping artifacts).
I updated the script. Now I see it skips A280000 because there is no marker annotation for that node. Good. Also the number of genotype markers got further reduced to 13209385. I checked the gemma code and the SNP annotation file should match the genotype file line for line. Usurprising, perhaps, but now I need to rewrite both. After adapting the script we now have to files with the same number of lines.
Rerunning with the new files:
gemma -g new-genotypes.txt -p pheno_filtered_143.txt -gk gemma -g new-genotypes.txt -p pheno_filtered_143.txt -k output/result.cXX.txt -maf 0.05 -lmm 4 -a snps-matched.txt
And, even though the results differ somewhat in size -- due to the different number of markers -- the results look very similar to what was produced before. Good!
Now we have confirmation and all the pieces we can run the same set with gemma-wrapper and LOCO.
The first 'challenge' is that gemma-wrapper computes hash values using a Ruby lib which is rather slow. This is also something we encounter in guix. I replaced that by using our pfff hashing for larger files.
/bin/time -v ../bin/gemma-wrapper --json --loco --jobs 8 -v -- -g new-genotypes.txt -p pheno_filtered_143.txt -gk -a snps-matched.txt > K.json
For this computation each gemma maxed out at 80Gb RAM (total 640Gb). We are really hitting limits here. In the near future we need to check why so much data is retained. As we only have 150 individuals it is a marker thing.
/bin/time -v ../bin/gemma-wrapper -v --json --lmdb --loco --input K.json -- -g new-genotypes.txt -p pheno_filtered_143.txt -a snps-matched.txt -debug -maf 0.05 -lmm 9 > GWA.json
This time gemma requires only 25Gb per chromosome, so we can run it in one go in RAM on this large server. Much of the time is spent in IO, so I think that when we start using mmap (lmdb) we can speed it up significantly. gemma-wrapper has a wall clock time of 10 minutes utilizing 17 cores.
Some chromosomes failed with 'ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt2'. Running the same with -lmm 9 passed. I'll need to keep an eye on that one.
After some fixes we now have loco in an lmdb output. The mdb file comes in at 693Mb. That will make 9TB for 13K traits. Storing the full vector is probably not wise here (and arguably we won't ever use it at this size - we should use the smoothed haplotypes). Only storing the significant values (4.0) made the size 17Mb. That makes it 215Gb total. Which is manageable. I made it even smaller by removing the (superfluous) hits from the metadata. Now down to 7Mb and 3.2Mb compressed. That'll total less than 100Gb for 13K traits. Good.
Now gemma-wrapper works (and test results are confirmed) we have to wire it up to fetch traits from the DB. We also have to make sure the trait values align with the individuals in the genotype file. Earlier I was running the script gemma-batch-run.sh:
which looks like:
export TMPDIR=./tmp
curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
jq ".[] | .Id" < bxd-publish.json > ids.txt
# ---- Compute GRM
./bin/gemma-wrapper --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
# ---- For all entries run LMM
for id in 'cat ids.txt' ; do
echo Precomputing $id
if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --phenotypes pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
fi
done
We already have ids.txt and the GRM. What is required is the trait values from the DB. What we need to do is run gn-guile somewhere with access to the DB. Also I need to make sure the current gemma-wrapper tar-balls up the result.
OK, we are running. Looks like the smaller datasets only use 11Gb RES RAM per chromosome. Which means we can run two computes in parallel on this machine.
The first run came through! I forgot the --reduce flag, so it came as 190Mb. I'll fix that. 34 individuals ran in 7 minutes. We are currently runnings at a trait in 6 min. We can double that on this machine.
The following puzzles me a bit
## number of analyzed individuals = 31 ## number of covariates = 1 ## number of phenotypes = 1 ## leave one chromosome out (LOCO) = 14 ## number of total SNPs/var = 13209385 ## number of SNPS for K = 12322657 ## number of SNPS for GWAS = 886728 ## number of analyzed SNPs = 13122153
why is the number of SNPs for GWAS low? Perhaps a threshold of 10% for maf is a bit stringent. See below.
Anyway, we are running traits and the first 500 we'll use for analysis.
Meanwhile I'll look at deploying on octopus and maybe speeding up GEMMA. See
GEMMA has a MAF filter. For every SNP a maf is computed by adding the geno value:
maf += geno
when all genotype values are added up MAF is divided by 2x the number of individuals (minus missing).
maf /= 2.0 * (double)(ni_test - n_miss);
and this is held against the maf passed on the command line. The 2.0 therefore assumes all values are between 0 and 2.
Actually I now realise we are using LOCO. So the number of SNPs are the ones on one chromosome. That makes sense! Still we have to be careful about the MAF range. In our genotype file the values are between 0 and 2. So that is fine in itself.
Next step is to generate RDF. The SNP annotation was slow, so I moved that to lmdb. Parsing 400 traits now takes 3 minutes. The RDF file is under 1Gb and the SNP annotation RDF is 330Mb. Not too bad!
guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-snps-400.n3','http://pan-test.genenetwork.org');
Done. -- 3 msec.
# for testing the validity and optional delete problematic ones:
SQL> SELECT * FROM DB.DBA.load_list;
SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
# commit changes
SQL> rdf_loader_run ();
SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-400.n3','http://pan-test.genenetwork.org');
SQL> rdf_loader_run ();
SQL> checkpoint;
Done. -- 16 msec.
SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
34200686
Or in the web interface:
SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o }
The RDF is formed as:
gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedTrait;
rdfs:label "GEMMA BXDPublish trait 10383 mapped with LOCO (defaults)";
gnt:trait gn:publishXRef_10383;
gnt:loco true;
gnt:run gn:test;
gnt:time "2025/11/10 08:12";
gnt:belongsToGroup gn:setBxd;
gnt:name "BXDPublish";
gnt:traitId "10383";
gnt:nind 14;
gnt:mean 18.0;
gnt:std 10.9479;
gnt:skew 0.3926;
gnt:kurtosis -1.1801;
skos:altLabel "BXD_10383";
gnt:filename "0233fa0cf277ee7d749de08b32f97c8be6478639-BXDPublish-10383-gemma-GWA.tar.xz";
gnt:hostname "napoli";
gnt:user "wrk".
gn:A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedLocus;
gnt:mappedSnp gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639;
gnt:locus gn:A8828461_0;
gnt:lodScore 4.8;
gnt:af 0.536;
gnt:effect -32.859.
and SNPs are annotated as
gn:A8828461_0 a gnt:marker;
rdfs:label "A8828461-0";
gnt:chr "1";
gnt:pos 3304440.
gn:A8828464_0 a gnt:marker;
rdfs:label "A8828464-0";
gnt:chr "1";
gnt:pos 3304500.
To get all tested traits you can list:
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
?trait a gnt:mappedTrait;
gnt:run gn:test ;
gnt:traitId ?traitid ;
gnt:kurtosis ?kurtosis .
} limit 100
To get all SNPs for trait "10001"
SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
gnt:run gn:test ;
gnt:traitId "10381" .
?snp gnt:mappedSnp ?traitid ;
gnt:locus ?locus .
?locus rdfs:label ?nodeid ;
gnt:chr ?chr ;
gnt:pos ?pos .
}
Lists:
| http://genenetwork.org/id/A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 | "A8828461-0" | "1" | 3304440 |
Next step is annotating the QTL in RDF. Earlier I wrote a script rdf-analyse-gemma-hits. It uses rapper to read two RDF files (two runs) and annotates the QTL and differences between the files. The code is not pretty:
The supporting library is a bit better:
Basically we have a QTL locus (QLocus) that tracks chr,pos,af and lod for each hit. QRange is a set of QLocus which also tracks some stats chr,min,max,snps,max_af,lod. It can compute whether two QTL (QRange) overlap. Next we have a container that tracks the QTL (QRanges) on a chromosome.
Finally there is a diff function that can show the differences on a chromosome (QRanges) for two mapped traits.
Maybe the naming could be a bit better, but the code is clear as it stands. On thing to note is that we use a fixed distance MAX_SNP_DISTANCE_BPS of 50M that decides whether a SNP falls in the same QTL. It would be worth trying to base it on dropping LOD scores (1.5 from the top). Rob and Flavia pointed out.
So, the library is fine, but the calling program is not great. The reason is that I parse RDF directly, teasing apart the logic we do in above SPARQL. I track state in dictionaries (hashes of hashes) and the result ends up convoluted. Also a lot of state in RAM. I chose RDF direct parsing because it makes for easier development. The downside is that I need to parse the whole file to make sure I have everything related to a trait. To fetch SNP results from SPARQL directly is slow too. I am in a bind.
Using curl:
time curl -G http://sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId ?trait ; gnt:kurtosis ?k . }
time curl -G http:///sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId \"10001\" ; gnt:kurtosis ?k . ?snp gnt:mappedSnp ?traitid ; gnt:locus ?locus . }
" > test.out
real 0m1.612s
user 0m0.020s
sys 0m0.000s
To get the trait info for 400 traits takes a second. So, that is no big deal. To get the 6K SNPs for one trait also takes a second. Hmmm. That takes hours, compared to the minutes for direct RDF parsing. Before lmdb comes to the rescue we should try running in on the virtuoso server itself. For curl we get 0.5s. Which makes it two hours for 13K traits. But when we run the query using isql it runs in 70ms which totals 15 minutes. That is perfectly fine for running the whole set!
One way is to simply script isql from the command line. Meanwhile, it also turns out the ODBC interface can be used from python or ruby. Here an example in R:
Not sure if that is fast enough, but perhaps worth trying.
So, now we have a way to query the data around a trait in seconds. This means I can rewrite the QTL generator to go by trait. This also allows for a quick turnaround during development (good!). Also I want two scripts: one for computing the QTL and one for annotating the differences.
Alright. The first script should simply to fetch a trait with its markers from SPARQL and score the QTL (as RDF output). The new script is at
First, the query for one trait looks like:
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT ?lod ?af ?nodeid ?chr ?pos FROM <http://pan-test.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
gnt:run gn:test ;
gnt:traitId "10002" .
?snp gnt:mappedSnp ?traitid ;
gnt:locus ?locus ;
gnt:lodScore ?lod ;
gnt:af ?af .
?locus rdfs:label ?nodeid ;
gnt:chr ?chr ;
gnt:pos ?pos .
} ORDER BY DESC(?lod)
rendering some 22K markers for trait 10002 as a TSV:
"lod" "af" "nodeid" "chr" "pos" 7.5 0.547 "A13459298-0" "8" 98658490 7.1 0.154 "A13402313-0" "8" 96798487 7 0.432 "A13446492-0" "8" 97355019 7 0.263 "A13387873-0" "8" 94934820 7 0.585 "A4794343-0" "1" 172265488 ...
Earlier with precompute for trait 10002 we got:
[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
so the hits are in range, but the LOD may be inflated because of the number of markers. Anyway, this point we are merely concerned with scoring QTL. The first script is simply:
qtls = QTL::QRanges.new("10002","test")
CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
qlocus = QTL::QLocus.new(hit["nodeid"],hit["chr"],hit["pos"].to_i,hit["af"].to_f,hit["lod"].to_f)
qtls.add_locus(qlocus)
end
print qtls
and prints a long list of QTL containing a single hit.
[10002,test] =>{"1"=>[#<QRange Chr1 𝚺1 3099543..3099543 AF=0.583 LOD=5.8..5.8>, #<QRange Chr1 𝚺1 65908328..65908328 AF=0.627 LOD=5.7..5.7>, #<QRange Chr1 𝚺1 81604902..81604902 AF=0.451 LOD=5.5..5.5>, #<QRange Chr1 𝚺2 85087169..85087177 AF=0.781 LOD=5.5..5.6>, #<QRange Chr1 𝚺1 93740525..93740525 AF=0.762 LOD=6.5..6.5>, #<QRange Chr1 𝚺1 114086053..114086053 AF=0.568 LOD=5.7..5.7>,...
For trait 10002 tweaking thresholds and rebinning we get
#<QRange Chr8 𝚺2 34.303454..35.675301 AF=0.571 LOD=5.7..5.8> #<QRange Chr8 𝚺621 91.752748..102.722635 AF=0.663 LOD=5.6..7.5> #<QRange Chr1 𝚺16 65.908328..175.232335 AF=0.781 LOD=5.6..7.0> #<QRange Chr4 𝚺5 56.498971..126.135422 AF=0.657 LOD=5.6..6.4> #<QRange Chr12 𝚺3 23.037869..58.306731 AF=0.643 LOD=5.8..6.2> #<QRange Chr10 𝚺2 13.442071..13.442088 AF=0.641 LOD=5.8..6.0> #<QRange Chr10 𝚺3 94.246536..103.438796 AF=0.608 LOD=5.9..6.2> #<QRange Chr3 𝚺2 47.644513..82.451061 AF=0.548 LOD=5.7..6.2> #<QRange Chr9 𝚺2 97.445077..120.263403 AF=0.717 LOD=5.8..5.8> #<QRange Chr11 𝚺2 27.4058..56.30011 AF=0.559 LOD=5.7..5.7>
with a LOD>5.5 cut-off. That seems justified because LOD scores are inflated. Compare this with the earlier mapping using 'traditional' genotypes:
[10002,LOCO] =>{
"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>,
#<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>],
"8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
we can see the significance of chr8 has gone up with pangenome mapping (relative to chr1) and we find 2 QTL now on chr8, a new one to the left. Chr1 looks similar. We have some other candidates that may or may not be relevant (all narrow!).
Note this *is* a random trait(!) and suggests the landscape of QTLs will change pretty dramatically. Note also that Andrea will give new genotypes and smoothing to follow. But it is encouraging.
I played a bit with the QTL output, and for now settled on tracking nodes that have a LOD>5.0. We drop QTL based on the following:
qtl.lod.max < 6.0 or (qtl.lod.max < 7.5 - qtl.snps.size/2)
I.e. a single SNP QTL has to have a LOD of 7.0. A 2-SNP QTL has to have a LOD of 6.5. This begets
[10002,test] =>{
"1"=>[#<QRange Chr1 𝚺69 3.099543..192.718161 AF=0.781 LOD=5.1..7.0>],
"4"=>[#<QRange Chr4 𝚺12 56.498971..147.86044 AF=0.676 LOD=5.1..6.4>],
"8"=>[#<QRange Chr8 𝚺2774 34.303454..116.023702 AF=0.899 LOD=5.1..7.5>],
"10"=>[#<QRange Chr10 𝚺7 82.334108..105.062097 AF=0.623 LOD=5.1..6.2>],
"12"=>[#<QRange Chr12 𝚺9 21.707644..72.57041 AF=0.77 LOD=5.1..6.2>]}
which are all worth considering (I think). Obviously we could annotate all QTL in RDF triples and filter on that using SPARQL. But this makes processing a bit faster without having to deal with too much noise. We can fine tune later.
Now two more steps to go:
SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
gnt:run gn:test ;
gnt:traitId "10002" .
?snp gnt:mappedSnp ?traitid ;
gnt:locus ?locus ;
gnt:lodScore ?lod ;
gnt:af ?af .
?locus rdfs:label ?nodeid ;
gnt:chr ?chr ;
gnt:pos ?pos .
} ORDER BY DESC(?lod)
The first step is to fetch this data. Let's try SPARQL over the web first.
The previous code I wrote to compare QTLs essentially walks the QTLs and annotates a new QTL if there is no overlap between the two sets. Again, this code is too convoluted:
The principle is straightforward, however. The code for reading the SPARQL output for a trait is
CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
trait_id = hit["traitid"] if not trait_id
lod = hit["lod"].to_f
if lod > 5.0 # set for pangenome input
qlocus = QTL::QLocus.new(hit["snp"],hit["chr"],hit["pos"].to_f/10**6,hit["af"].to_f,lod)
qtls.add_locus(qlocus)
end
end
So we can use SPARQL to build two sets on the fly and then run the diff.
Actually, when thinking about this I realised it should not be too hard to do in SPARQL to find the 'new' QTL.
SELECT * WHERE {
?traitid a gnt:mappedTrait ;
gnt:traitId "10002" .
}
http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d
http://genenetwork.org/id/HK_trait_BXDPublish_10002_gemma_GWA_hk_assoc_txt
http://genenetwork.org/id/GEMMAMapped_test_LOCO_BXDPublish_10002_gemma_GWA_82087f23
lists the three versions of compute for traits. To fetch all QTL for first mapping:
SELECT ?qtl ?lod ?chr ?start ?stop (count(?snp) as ?snps) WHERE {
?traitid a gnt:mappedTrait ;
gnt:traitId "10002" .
?qtl gnt:mappedQTL ?traitid ;
gnt:qtlChr ?chr ;
gnt:qtlStart ?start ;
gnt:qtlStop ?stop ;
gnt:qtlLOD ?lod .
?qtl gnt:mappedSnp ?snp .
}
gets 3 QTL. Now I did not store HK in RDF, but to show the filtering principle we can fetch two traits and compare QTL. The following gets two QTL from trait "10002" on CHR1 and holds that against that of trait "10079":
SELECT ?t ?s1 ?e1 ?t2 ?s2 ?e2 WHERE {
?traitid a gnt:mappedTrait ;
gnt:traitId ?t .
?qtl gnt:mappedQTL ?traitid ;
gnt:qtlChr ?chr ;
gnt:qtlStart ?s1 ;
gnt:qtlStop ?e1 .
{
SELECT * WHERE {
?tid a gnt:mappedTrait ;
gnt:traitId "10079" ;
gnt:traitId ?t2 .
?qtl2 gnt:mappedQTL ?tid ;
gnt:qtlChr ?chr ;
gnt:qtlStart ?s2 ;
gnt:qtlStop ?e2 .
}
}
FILTER (?t = "10002") .
} LIMIT 10
"10002",171.172,183.154,"10079",172.235,172.235
"10002",72.2551,73.3771,"10079",172.235,172.235
Note we pivot on two traits and one chromosome, so we find all pairs. To say if a QTL is *new* or different we can add another FILTER
FILTER ((?s2 > ?s1 && ?e2 > ?e1) || (?s2 < ?s1 && ?e2 < ?e1)) . "t","s1","e1","t2","s2","e2" "10002",72.2551,73.3771,"10079",172.235,172.235
that says that this ?qtl2 does not overlap with ?qtl. I.e. here it is a new QTL!
This new insight means we should should store *all* QTL in RDF, including the single SNP ones, because it is easy to filter on them. Note that there may be a more elegant way to query traits pairwise. This is just the first thing that worked. It may need more tuning if there are more than two QTL on a chromosome. E.g. the comparison between 10002 and 10413 finds:
"t","s1","e1","t2","s2","e2" "10002",72.2551,73.3771,"10413",32.3113,42.4624 "10002",171.172,183.154,"10413",171.04,171.041 "10002",171.172,183.154,"10413",32.3113,42.4624 "10002",72.2551,73.3771,"10413",171.04,171.041
I.e. it does find new QTL here and you still need to do a little set analysis. In words you should be able to "remove all overlapping QTL from a chromosome". Maybe we can filter the other way - select overlapping QTL and remove those from the result set.
BIND ((?s2 >= ?s1 && ?e2 <= ?e1) || (?s1 >= ?s2 && ?e1 <= ?e2) as ?overlap) . "10002",171.172,183.154,"10079",172.235,172.235,1 "10002",72.2551,73.3771,"10079",172.235,172.235,0
now drop all ?t's that are overlapping. It appears to work with:
I'll need to test it on the pangenome set.
To get all QTL from a run you can use something like
SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE {
?traitid a gnt:mappedTrait ;
gnt:traitId ?t .
MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL
# ?traitid gnt:run gn:test . # use if you want the new QTL
?qtl gnt:mappedQTL ?traitid ;
gnt:qtlChr ?chr ;
gnt:qtlLOD ?lod ;
gnt:qtlStart ?s ;
gnt:qtlStop ?e .
?qtl gnt:mappedSnp ?snp .
FILTER (?t = "10002" && ?lod >= 5.0 ) .
} LIMIT 100
Note we filter on a trait name and LOD score.
For panQTL (gnt:run == gn:test) this results in
"t" "lod" "snps" "chr" "start" "end" "10002" 6.4 3 "15" 87.671663 98.028911 "10002" 6.4 12 "4" 56.498971 147.86044 "10002" 7 69 "1" 3.099543 192.718161 "10002" 7.5 2774 "8" 34.303454 116.023702 "10002" 6.2 7 "10" 82.334108 105.062097 "10002" 6.2 2 "3" 47.644513 82.451061 "10002" 6.2 1 "3" 130.145235 130.145235 "10002" 6 2 "10" 13.442071 13.442088 "10002" 6.2 9 "12" 21.707644 72.57041
For the traditional genotypes (gnt:run != gn:test)
"t" "lod" "snps" "chr" "start" "end" "10002" 5.3 91 "1" 171.172 183.154 "10002" 5.1 15 "1" 72.2551 73.3771
Now we have all QTLs in the DB, as well as underlying SNPs, one interesting question to ask is what SNPs are repeated across our traits. This, if you remember, is the key idea of reversed genetics. Of course, with our pangenome-derived genotypes, we now have thousands of SNPs per trait. Let's see if we can rank them by number of traits.
For our 1000 traits we map about 7.7M snps with a LOD>5
Note: if you are doing SPARQL quite a bit, I recommend using sparql-mode in emacs! It is easy, faster and you can use git :)
M-x sparql-query-region [ENTER] http://sparql-test.genenetwork.org/sparql/ [ENTER]