Edit this page | Blame

Large datasets crash gemma

Running GEMMA on the NSNIH dataset in Genenetwork sends the server in a tail spin and logs `BUG: soft lockup CPU stuck` messages. This obviously is not great and appears to be a side effect of running openblas aggressively in parallel (I remember seeing some evidence of that, but I can no longer find that message). Or it may be GEMMA simply runs out of RAM and the kernel is busy cleaning up using the OOM reaper. See

https://lkml.iu.edu/hypermail/linux/kernel/2003.2/01012.html

Tasks

[ ] tux02: test out-of-band-access
[ ] tux02: test GEMMA
[ ] tux02: set overcommit memory on tux02 to 2 (see below)
[ ] tux02: reboot and reinstate services on tux02
[ ] tux02: test GEMMA
[ ] tux02: try and optimize versions of openblas using -O2
[ ] tux02: deploy GEMMA latest

And do the same on tux01 (production)

Notes

A 'soft lockup' is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds without giving other tasks a chance to run. The watchdog daemon will send an non-maskable interrupt (NMI) to all CPUs in the system who, in turn, print the stack traces of their currently running tasks.

After a gemma lockup we see

[2512382.403215] watchdog: BUG: soft lockup - CPU#118 stuck for 22s! [migration/118:609]
[2512404.477219] Out of memory: Kill process 1723 (gemma) score 87 or sacrifice child
[2512404.569158] Killed process 1723 (gemma) total-vm:44620288kB, anon-rss:25261688kB, file-rss:0kB, shmem-rss:0kB
[2512405.788221] oom_reaper: reaped process 1723 (gemma), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

It is clear parallel GEMMA is running out of RAM. We can make softlocks messages relax by setting `/proc/sys/kernel/watchdog_thresh` higher. Consider the message harmless.

Overcommit is set to 0 on Tux01. We may want to change that to

vm.overcommit_memory=2
vm.overcommit_ratio=90

That will make out-of-RAM problems less impactful. We have been running penguin2 like this for over a year with no more OOM problems. I have not set that before on tux01 because it requires rebooting the production server.

From Zach I got the K and GWA commands:

/usr/local/guix-profiles/gn-latest-20220122/bin/gemma-wrapper --json --loco 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,X -- -debug -g /home/zas1024/gn2-zach/genotype_files/genotype/bimbam/HSNIH-Palmer_true_geno.txt -p /home/zas1024/gn2-zach/tmp/gn2/gn2/PHENO_2+FcfQiTVSC7FmmbsatUPg.txt -a /home/zas1024/gn2-zach/genotype_files/genotype/bimbam/HSNIH-Palmer_true_snps.txt -gk > /home/zas1024/gn2-zach/tmp/gn2/gn2/HSNIH-Palmer_K_TPTFHJ.json

/usr/local/guix-profiles/gn-latest-20220122/bin/gemma-wrapper --json --loco --input /home/zas1024/gn2-zach/tmp/gn2/gn2/HSNIH-Palmer_K_TPTFHJ.json -- -debug -g /home/zas1024/gn2-zach/genotype_files/genotype/bimbam/HSNIH-Palmer_true_geno.txt -p /home/zas1024/gn2-zach/tmp/gn2/gn2/PHENO_2+FcfQiTVSC7FmmbsatUPg.txt -a /home/zas1024/gn2-zach/genotype_files/genotype/bimbam/HSNIH-Palmer_true_snps.txt -lmm 9 -maf 0.05 > /home/zas1024/gn2-zach/tmp/gn2/gn2/HSNIH-Palmer_GWA_MWKKYW.json

The geno file is massive:

3.7G Mar 12 11:56 HSNIH-Palmer_true_geno.txt
 24K Mar 12 11:56 PHENO_2+FcfQiTVSC7FmmbsatUPg.txt
3.4M Mar 12 11:56 HSNIH-Palmer_true_snps.txt

Probably best to test on a different machine! Let's move to tux02. Running luna (a half year old version of GN2) gives `**** FAILED: number of columns in the kinship file does not match the number of individuals for row = 79`. So, that does not help! I think this is a known issue that got fixed later. Next up, try and run gemma by hand for (largest) chromosome 1 after installing gemma tools with a recent GNU Guix:

tux02:~/tmp/gemma-cpu-lockup$ /usr/bin/time -v gemma -loco 1 -gk -g HSNIH-Palmer_true_geno.txt -p PHENO_2+FcfQiTVSC7FmmbsatUPg.txt -a HSNIH-Palmer_true_snps.txt

GEMMA 0.98.5-pre1 (2021-08-14) by Xiang Zhou, Pjotr Prins and team (C) 2012-2021
Reading Files ...
## number of total individuals = 6147
## number of analyzed individuals = 1306
## number of covariates = 1
## number of phenotypes = 1
## leave one chromosome out (LOCO) =        1
## number of total SNPs/var        =   134918
## number of SNPS for K            =   121094
## number of SNPS for GWAS         =    13824
## number of analyzed SNPs         =   132628
Calculating Relatedness Matrix ...
**** INFO: Done.
        Command being timed: "gemma -loco 1 -gk -g HSNIH-Palmer_true_geno.txt -p PHENO_2+FcfQiTVSC7FmmbsatUPg.txt -a HSNIH-Palmer_true_snps.txt"
        User time (seconds): 3096.83
        System time (seconds): 1435.51
        Percent of CPU this job got: 1781%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:14.36
        Maximum resident set size (kbytes): 26851444

That is 4 minutes wall clock time using about 26GB of RAM. The machine has 256 GB RAM, so you can see where the problem is. Gemma-wrapper should handle this gracefully, but because of overcommit the machine goes in a tail-spin.

Next run lmm with

/usr/bin/time -v gemma -loco 1 -lmm 9 -g HSNIH-Palmer_true_geno.txt -p PHENO_2+FcfQiTVSC7FmmbsatUPg.txt -a HSNIH-Palmer_true_snps.txt -k output/result.cXX.txt

GEMMA 0.98.5-pre1 (2021-08-14) by Xiang Zhou, Pjotr Prins and team (C) 2012-2021
Reading Files ...
## number of total individuals = 6147
## number of analyzed individuals = 1306
## number of covariates = 1
## number of phenotypes = 1
## leave one chromosome out (LOCO) =        1
## number of total SNPs/var        =   134918
## number of SNPS for K            =   121094
## number of SNPS for GWAS         =    13824
## number of analyzed SNPs         =   132628
Start Eigen-Decomposition...
pve estimate =0.161934
se(pve) =0.0355033
================================================== 100%
**** INFO: Done.
        Command being timed: "gemma -loco 1 -lmm 9 -g HSNIH-Palmer_true_geno.txt -p PHENO_2+FcfQiTVSC7FmmbsatUPg.txt -a HSNIH-Palmer_true_snps.txt -k output/result.cXX.txt"
        User time (seconds): 225.04
        System time (seconds): 110.63
        Percent of CPU this job got: 214%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:36.55
        Maximum resident set size (kbytes): 523876

Running the LMM takes only 2.5 minutes and uses less than 1GB RAM.

Now the goal is to try and crash the server before setting overcommit.

Large datasets crash gemma

Tags

Tasks

Notes