Here we follow up on the work we did on precompute PublishData:
But now run against pangenome derived genotypes. For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome.
For the first traits (presented at CTC'25) gemma was run as
echo "[$(date)] Starting kinship matrix calculation for PERCENTILE..."
gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
-p ${PHENO_FILE} \
-gk \
-o percentile_result > percentile.kinship.143.txt
echo "[$(date)] Kinship matrix calculation completed for PERCENTILE."
echo "[$(date)] Starting association analysis for PERCENTILE..."
gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
-p ${PHENO_FILE} \
-k ./output/percentile_result.cXX.txt \
-lmm 4 \
-maf 0.05 \
-o percentile_association > percentile.assoc.143.txt
Note no LOCO.
The genotype BIMBAM file is 45G uncompressed. Even though GEMMA does not load everything in RAM, it is a bit large for my workstation. I opted to use tux04 since no one is using it. Had to reboot the machine because it is unreliable and had crashed.
There I rebuilt gemma and set up a first run:
tux04:/export/data/wrk/iwrk/opensource/code/genetics/gemma/tmp$ /bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -gk
Without LOCO this took about 18 minutes (186% CPU), 110Gb of RAM. We ought to work on this ;) Next
/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -k output/result.cXX.txt -lmm 9 -maf 0.05
To run gemma on the current 23M BXD pangenome derived genotypes takes 2.5 hours (@ 200% CPU). That is a bit long :). 13K traits would be 43 months on a single machine. We'll need something better. As Rob writes:
The huge majority of variants will have r2 of 1 with hundreds ir thousands of neighbors. This is just a monster distraction. We just want proximal and distal haplotype boundaries for each BXD. Then we want to layer on the weird non-SNP variants and inversions.
A few days later I had to rerun gemma because the output was wrong (I should have checked!). It shows:
chr rs ps n_miss allele1 allele0 af beta se logl_H1 l_remle l_mle p_wald p_lrt p_score -9 A1-0 -9 0 A T 0.171 -nan -nan -nan 1.000000e+05 1.000000e+05 -nan -nan -nan -9 A2-0 -9 0 A T 0.170 -nan -nan -nan 1.000000e+05 1.000000e+05 -nan -nan -nan
Turns out I was using the wrong pheno file. Let's try again.
/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 10354082_143.list.pang.txt -k output/result.cXX.txt -lmm 9 -maf 0.05
As a check I can diff against the original output. So, I replicated the original run! It also ran faster at 400% CPU in 35 minutes.
(btw tux04 crashed, so I upgraded the BIOS and iDRAC remotely, let's see if this improves things).
gemma-wrapper has extra facilities, such as LOCO and caching and lmdb output. Last time we used it in
in a guix container it looked like
#! /bin/env sh
export TMPDIR=./tmp
curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
jq ".[] | .Id" < bxd-publish.json > ids.txt
./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
for id in 'cat ids.txt' ; do
echo Precomputing $id
if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
fi
done
Let's try running the big stuff instead:
./bin/gemma-wrapper --force --json --loco -- -g tmp/143samples.percentile.bimbam.bimbam.gz -p tmp/143samples.percentile.bimbam.pheno.gz -gk
gemma does not really track individuals. The order of genotype columns should just be the same as in the pheno file. In this case a sample list is provided and we'll generate a geno-json version that we can give to gemma-wrapper. Basically such a file lists the following
{
"type": "gn-geno-to-gemma",
"genofile": "BXD.geno",
"samples": [
"BXD1",
"BXD2",
"BXD5",
...
],
"numsamples": 237,
"header": [
"# File name: BXD_experimental_DGA_7_Dec_2021",
...
To get this
cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e s,$,\"\,,| cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e "s,$,\"\\,," > bxd_inds.list.txt "BXD100", "BXD101", "BXD102",
Next I turned it into a JSON file by hand as 'bxd_inds.list.json'.
With GEMMA marker names are listed in the geno file. GEMMA also can use a SNP file that gives the chromosome and location. Without the SNP filegemma-wrapper complains it needs the SNP/marker annotation file. This is logical because for LOCO it needs to know what chromosome a marker is on.
The next step is to take the nodes file that and extract all rows from the genotype file that match nodes with chromosomes defined. Andrea is going to deliver all positions for all nodes, but for now we can use what we have. Currently we have nodes annotated in mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt:
mm10#1#chr3 23209565 93886997 mm10#1#chr3 23209564 93886999 mm10#1#chr3 23209563 93887016 ...
In the genotype file we find, for example
A23209564-0, A, T, 1.919141867395325, 0.9306930597711228, 1.8201319833577734, 0.7607260422339468, 1.427392726736106, 1.2310230984252724, 1.6633662444541875, 0.6105610229068721, ...
bit funny, but you get the idea. So we can take the mm10 file and write out the genotype file again for all matching nodes with a matching SNP file that should contain for this node:
A23209564-0 93886999 3
To rewrite above mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt file we can do something like
#! ruby
ARGF.each_line do |line|
tag,name,pos = line.strip.split(/\t/)
tag =~ /chr(.*)$/
chrom = $1
print "A#{name}-0\t#{pos}\t#{chrom}\n"
end
Now, another problem is that not all SNPs have a position in the genotype file (yet). As we can't display them I can drop them at this stage. So we take the SNP file and rewrite the BIMBAM file using that information. That throwaway script looks like
bimbamfn = ARGV.shift
snpfn = ARGV.shift
snps = {}
open(snpfn).each_line do |snpl|
name = snpl.split(/\t/)[0]
snps [name] = 1
end
open(bimbamfn).each_line do |line|
marker = line.split(/[,\s]/)[0]
if snps[marker]
print line
end
end
takes a while to run, but as this is a one-off that does not matter. Reducing the file leads to 13667900 markers with genotypes. The original SNP file has 14927024 lines. Hmmm. The overlap is therefor not perfect (we have more annotations than genotypes now). To check this I'll run a diff.
cut -f 1 -d "," 143samples.percentile.bimbam.bimbam-reduced > 143samples.percentile.bimbam.bimbam-reduced-markers sort 143samples.percentile.bimbam.bimbam-reduced-markers > markers-sorted.txt diff --speed-large-files 143samples.percentile.bimbam.bimbam-reduced-markers markers-sorted.txt < A80951-0 < A80952-0 < A80953-0 ... cut -f 1 snps.txt |sort > snps-col1-sorted.txt diff --speed-large-files snps-col1-sorted.txt markers-sorted.txt 241773d228996 < A10314686-0 241777d228999 < A10314689-0 241781d229002 < A10314692-0 grep A10314686 snps-col1-sorted.txt markers-sorted.txt snps-col1-sorted.txt:A10314686-0 snps-col1-sorted.txt:A10314686-0 markers-sorted.txt:A10314686-0
Ah, we have duplicate annotation lines in the SNP file.
grep A10314686-0 snps.txt A10314686-0 20257882 8 A10314686-0 20384895 8 grep A10314692-0 snps.txt A10314692-0 20257575 8 A10314692-0 20384588 8
so, the same node is considered two snps. This is due to the node covering multiple inds (paths). Turns out a chunk of them map on different chromosomes too. I think we ought to drop them until we have a better understanding of what they represent (they may be mismapping artifacts).
I updated the script. Now I see it skips A280000 because there is no marker annotation for that node. Good. Also the number of genotype markers got further reduced to 13209385. I checked the gemma code and the SNP annotation file should match the genotype file line for line. Usurprising, perhaps, but now I need to rewrite both. After adapting the script we now have to files with the same number of lines.
Rerunning with the new files:
gemma -g new-genotypes.txt -p pheno_filtered_143.txt -gk gemma -g new-genotypes.txt -p pheno_filtered_143.txt -k output/result.cXX.txt -maf 0.05 -lmm 4 -a snps-matched.txt
And, even though the results differ somewhat in size -- due to the different number of markers -- the results look very similar to what was produced before. Good!
Now we have confirmation and all the pieces we can run the same set with gemma-wrapper and LOCO.
The first 'challenge' is that gemma-wrapper computes hash values using a Ruby lib which is rather slow. This is also something we encounter in guix.
/bin/time -v ../bin/gemma-wrapper --json --loco --jobs 8 -v -- -g new-genotypes.txt -p pheno_filtered_143.txt -gk -a snps-matched.txt > K.json
For this computation each gemma maxed out at 80Gb RAM (total 640Gb). We are really hitting limits here. In the near future we need to check why so much data is retained. As we only have 150 individuals it is a marker thing.
/bin/time -v ../bin/gemma-wrapper -v --json --lmdb --loco --input K.json -- -g new-genotypes.txt -p pheno_filtered_143.txt -a snps-matched.txt -debug -maf 0.05 -lmm 9 > GWA.json
This time gemma requires only 25Gb per chromosome, so we can run it in one go in RAM on this large server. Much of the time is spent in IO, so I think that when we start using mmap (lmdb) we can speed it up significantly. gemma-wrapper has a wall clock time of 10 minutes utilizing 17 cores.
Some chromosomes failed with 'ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt2'. Running the same with -lmm 9 passed. I'll need to keep an eye on that one.
After some fixes we now have loco in an lmdb output. The mdb file comes in at 693Mb. That will make 9TB for 13K traits. Storing the full vector is probably not wise here (and arguably we won't ever use it at this size - we should use the smoothed haplotypes). Only storing the significant values (4.0) made the size 17Mb. That makes it 215Gb total. Which is manageable. I made it even smaller by removing the (superfluous) hits from the metadata. Now down to 7Mb. That'll total some 100Gb for 13K traits.