Based on the QTL_Reaper_cal_lrs.py aka QTL_Reaper_v8_PublishXRef.py. This script simply updates PublishXRef table with a highest hit as computed by qtlreaper.
In a first attempt to update the database we are going to do just that using GEMMA.
For the new script we will pass in the genotype file as well as the phenotype file, so gemma-wrapper can process it. I wrote quite a few scripts already
So we can convert a .geno file to BIMBAM. I need to extract GN traits to a R/qtl2 or lmdb trait format file and use that as input.
For the last we should probably add a few columns. Initially we'll only store the maximum hit.
After
In GN2 this table is used in search, auth, and router. For search it is to look for trait hits (logically). For the router it is to fetch train info as well as dataset info.
In GN3 this table is used for partial correlations. Also to fetch API trait info and to build the search index.
In GN1 usage is similar.
We can use the script in gemma-wrapper
there is probably something similar in GN2. And I have another version somewhere.
To identify the geno file the reaper script uses
cursor.execute('select Id, Name from InbredSet') results = cursor.fetchall() InbredSets = {} for item in results: InbredSets[item[0]] = genotypeDir+str(item[1])+'.geno'
which assumes one single geno file for the BXD that is indexed by the InbredSetID (a number). Note it ignores the many genotype files we have per inbredset (today). Also there is a funny hardcoded
if InbredSetId==3: InbredSetId=1
(no comment).
Later we'll output to lmdb when GEMMA supports it.
There are about 100 InbredSets. Genotype files can be found on production in /export/guix-containers/genenetwork/var/genenetwork/genotype-files/genotype. For the BXD alone there are
BXD.2.geno BXD-Heart-Metals_old.geno BXD-Micturition.6.geno BXD.4.geno BXD-JAX-AD.4.geno BXD-Micturition.8.geno BXD.5.geno BXD-JAX-AD.8.geno BXD-Micturition.geno BXD.6.geno BXD-JAX-AD.geno BXD-Micturition_old.4.geno BXD.7.geno BXD-JAX-AD_old.geno BXD-Micturition_old.6.geno BXD.8.geno BXD-JAX-OFS.geno BXD-Micturition_old.geno BXD-AE.4.geno BXD-Longevity.4.geno BXD_mm8.geno BXD-AE.8.geno BXD-Longevity.8.geno BXD-NIA-AD.4.geno BXD-AE.geno BXD-Longevity.9.geno BXD-NIA-AD.8.geno BXD-AE_old.geno BXD-Longevity.array.geno BXD-NIA-AD.geno BXD-Bone.geno BXD-Longevity.classic.geno BXD-NIA-AD_old2.geno BXD-Bone_orig.geno BXD-Longevity.geno BXD-NIA-AD_old.geno BXD.geno BXD-Longevity_old.4.geno BXD_Nov_23_2010_before_polish_101_102_103.geno BXD-Harvested.geno BXD-Longevity_old.8.geno BXD_Nov_24_2010_before_polish_55_81.geno BXD-Heart-Metals.4.geno BXD-Longevity_old.geno BXD_old.geno BXD-Heart-Metals.8.geno BXD-MBD-UTHSC.geno BXD_unsure.geno BXD-Heart-Metals.geno BXD-Micturition.4.geno BXD_UT-SJ.geno
Not really reflected in the DB:
MariaDB [db_webqtl]> select Id, Name from InbredSet where name like '%BXD%'; +----+------------------+ | Id | Name | +----+------------------+ | 1 | BXD | | 58 | BXD-Bone | | 64 | BXD-Longevity | | 68 | BXD_Dev | | 76 | DOD-BXD-GWI | | 84 | BXD-Heart-Metals | | 86 | BXD-AE | | 91 | BXD-Micturition | | 92 | BXD-JAX-AD | | 93 | BXD-NIA-AD | | 94 | CCBXD-TM | | 96 | BXD-JAX-OFS | | 97 | BXD-MBD-UTHSC | +----+------------------+
Bit of a mess. Looks like some files are discarded. Let's see what the reaper script does.
We should also look into distributed storage. One option is webdav.
Let's see how the scripts do it. Note that we already did that for the probeset script in
The code is reflected in
Now I need to do the exact same thing, but for PublishData.
Let's connect to a remote GN DB:
ssh -L 3306:127.0.0.1:3306 -f -N tux02.genenetwork.org
and follow
the script takes a number of values 'PublishFreezeIds'. Alternatively it picks it up by SpeciesId (hard effing coded, of course).
Next it picks the geno file from the InbredSetID with
select InbredSetId from PublishFreeze where PublishFreeze.Id = 1;
Here we are initially going to focus on BXD=1 datasets only.
MariaDB [db_webqtl]> select Id,InbredSetId from PublishFreeze where InbredSetId = 1; +----+-------------+ | Id | InbredSetId | +----+-------------+ | 1 | 1 | +----+-------------+
(we are half way the script now). Next we capture some metadata
MariaDB [db_webqtl]> select PhenotypeId, Locus, DataId, Phenotype.Post_publication_description from PublishXRef, Phenotype where PublishXRef.PhenotypeId = Phenotype.Id and InbredSetId=1 limit 5; +-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+ | PhenotypeId | Locus | DataId | Post_publication_description | +-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+ | 4 | rs48756159 | 8967043 | Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg] | | 10 | rsm10000005699 | 8967044 | Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg] | | 15 | rsm10000013713 | 8967045 | Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg] | | 20 | rs48756159 | 8967046 | Central nervous system, morphology: Cerebellum volume [mm3] | | 25 | rsm10000005699 | 8967047 | Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3] | +-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
it captures LRS
MariaDB [db_webqtl]> select LRS from PublishXRef where PhenotypeId=4 and InbredSetId=1; +--------------------+ | LRS | +--------------------+ | 13.497491147108706 | +--------------------+
and finally the trait values that are used for mapping
select Strain.Name, PublishData.value from Strain, PublishData where Strain.Id = PublishData.StrainId and PublishData.Id = 8967043; +-------+-----------+ | Name | value | +-------+-----------+ | BXD1 | 61.400002 | | BXD2 | 49.000000 | | BXD5 | 62.500000 | | BXD6 | 53.099998 | | BXD8 | 59.099998 | | BXD9 | 53.900002 | | BXD11 | 53.099998 | | BXD12 | 45.900002 | | BXD13 | 48.400002 | | BXD14 | 49.400002 | | BXD15 | 47.400002 | | BXD16 | 56.299999 | | BXD18 | 53.599998 | | BXD19 | 50.099998 | | BXD20 | 48.200001 | | BXD21 | 50.599998 | | BXD22 | 53.799999 | | BXD23 | 48.599998 | | BXD24 | 54.900002 | | BXD25 | 49.599998 | | BXD27 | 47.400002 | | BXD28 | 51.500000 | | BXD29 | 50.200001 | | BXD30 | 53.599998 | | BXD31 | 49.700001 | | BXD32 | 56.000000 | | BXD33 | 52.099998 | | BXD34 | 53.700001 | | BXD35 | 49.700001 | | BXD36 | 44.500000 | | BXD38 | 51.099998 | | BXD39 | 54.900002 | | BXD40 | 49.900002 | | BXD42 | 59.400002 | +-------+-----------+
Note that we need to filter out the parents - the original reaper script does not do that! My gn-guile code does handle that:
SELECT StrainId,Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId =1 AND Used_for_mapping<>'Y' limit 5; +----------+----------+ | StrainId | Name | +----------+----------+ | 1 | B6D2F1 | | 2 | C57BL/6J | | 3 | DBA/2J | | 150 | A/J | | 151 | AXB1 | +----------+----------+ etc.
Also Bonz' script
has an interesting query:
MariaDB [db_webqtl]> SELECT DISTINCT PublishFreeze.Name, PublishXRef.Id FROM PublishData INNER JOIN Strain ON PublishData.StrainId = Strain.Id INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId LEFT JOIN PublishSE ON PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId LEFT JOIN NStrain ON NStrain.DataId = PublishData.Id AND NStrain.StrainId = PublishData.StrainId WHERE PublishFreeze.public > 0 AND PublishFreeze.confidentiality < 1 ORDER BY PublishFreeze.Id, PublishXRef.Id limit 5; +------------+-------+ | Name | Id | +------------+-------+ | BXDPublish | 10001 | | BXDPublish | 10002 | | BXDPublish | 10003 | | BXDPublish | 10004 | | BXDPublish | 10005 | +------------+-------+ 5 rows in set (0.239 sec)
that shows we have 13689 BXDPublish datasets. It also has
SELECT JSON_ARRAYAGG(JSON_ARRAY(Strain.Name, PublishData.Value)) AS data, MD5(JSON_ARRAY(Strain.Name, PublishData.Value)) as md5hash FROM PublishData INNER JOIN Strain ON PublishData.StrainId = Strain.Id INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId LEFT JOIN PublishSE ON PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId LEFT JOIN NStrain ON NStrain.DataId = PublishData.Id AND NStrain.StrainId = PublishData.StrainId WHERE PublishFreeze.Name = "BXDPublish" AND PublishFreeze.public > 0 AND PublishData.value IS NOT NULL AND PublishFreeze.confidentiality < 1 ORDER BY LENGTH(Strain.Name), Strain.Name LIMIT 5;
best to pipe that to a file. It outputs JSON and an MD5SUM straight from mariadb. Interesting.
Finally, let's have a look at the existing GN API
SELECT Strain.Name, Strain.Name2, PublishData.value, PublishData.Id, PublishSE.error, NStrain.count FROM (PublishData, Strain, PublishXRef, PublishFreeze) LEFT JOIN PublishSE ON (PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId) LEFT JOIN NStrain ON (NStrain.DataId = PublishData.Id AND NStrain.StrainId = PublishData.StrainId) WHERE PublishXRef.InbredSetId = 1 AND PublishXRef.PhenotypeId = 4 AND PublishData.Id = PublishXRef.DataId AND PublishData.StrainId = Strain.Id AND PublishXRef.InbredSetId = PublishFreeze.InbredSetId AND PublishFreeze.public > 0 AND PublishFreeze.confidentiality < 1 ORDER BY Strain.Name; +-------+-------+-----------+---------+-------+-------+ | Name | Name2 | value | Id | error | count | +-------+-------+-----------+---------+-------+-------+ | BXD1 | BXD1 | 61.400002 | 8967043 | 2.38 | NULL | | BXD11 | BXD11 | 53.099998 | 8967043 | 1.1 | NULL | | BXD12 | BXD12 | 45.900002 | 8967043 | 1.09 | NULL | | BXD13 | BXD13 | 48.400002 | 8967043 | 1.63 | NULL | ...
which actually blocks non-public sets and shows std err, as well as counts when available(?) It does not exclude the parents for mapping (btw). That probably happens on the mapping page itself.
Probably the most elegant query is in GN3 API:
SELECT st.Name, ifnull(pd.value, 'x'), ifnull(ps.error, 'x'), ifnull(ns.count, 'x') FROM PublishFreeze pf JOIN PublishXRef px ON px.InbredSetId = pf.InbredSetId JOIN PublishData pd ON pd.Id = px.DataId JOIN Strain st ON pd.StrainId = st.Id LEFT JOIN PublishSE ps ON ps.DataId = pd.Id AND ps.StrainId = pd.StrainId LEFT JOIN NStrain ns ON ns.DataId = pd.Id AND ns.StrainId = pd.StrainId WHERE px.PhenotypeId = 4 limit 5; +------+-----------------------+-----------------------+-----------------------+ | Name | ifnull(pd.value, 'x') | ifnull(ps.error, 'x') | ifnull(ns.count, 'x') | +------+-----------------------+-----------------------+-----------------------+ | BXD1 | 61.400002 | 2.38 | x | | BXD2 | 49.000000 | 1.25 | x | | BXD5 | 62.500000 | 2.32 | x | | BXD6 | 53.099998 | 1.22 | x | | BXD8 | 59.099998 | 2.07 | x | +------+-----------------------+-----------------------+-----------------------+
written by Zach and Bonface. See