Dump sample data to LMDB

assigned: bonfacem
priority: high
status: stalled
keywords: lmdb, rdf

Tasks

Dump data and add relevant RDF Metadata for LMDB for:

[ ] phenotypes
[ ] probesets
[ ] genotypes
[ ] GN2/3 Integration

For RDF: https://issues.genenetwork.org/search?query=assigned%3ABonfaceKilz+AND+tag%3Ain-progress+AND+priority%3Ahigh+AND+is%3Aopen&type=open-issue

[ ] Have files and named files available through RDF

Description

For GeneNetwork2, a dataset is made up of multiple traits, each with its own sample data. The trait's name is a combination of the species name and the trait's ID (for genotypes/probesets this may not be the case), which is obtained from a SQL table. The objective of this task is to store each trait's sample data in LMDB, allowing it to be accessed quickly in GN2/3 via RDF, which will decouple the data from the python-base classes/objects it is associated with, significantly improving sample data access speed.

To fetch all data, including case-attributes data, for published phenotypes in SQL (using BXD_10007 as an example), you would use the following:

http://genenetwork.org/show_trait?trait_id=10007&dataset=BXDPublish

SELECT A.Name, A.Value, A.SE, A.Count, B.CaseAttributes FROM
(SELECT DISTINCT st.Name as 'Name', ifnull(pd.value, 'x') as 'Value',
ifnull(ps.error, 'x') as 'SE', ifnull(ns.count, 'x') as 'Count', ps.StrainId as 'StrainId'
FROM PublishFreeze pf JOIN PublishXRef px ON px.InbredSetId = pf.InbredSetId
JOIN PublishData pd ON pd.Id = px.DataId JOIN Strain st ON pd.StrainId = st.Id
LEFT JOIN PublishSE ps ON ps.DataId = pd.Id AND ps.StrainId = pd.StrainId
LEFT JOIN NStrain ns ON ns.DataId = pd.Id AND ns.StrainId = pd.StrainId
WHERE px.PhenotypeId = 35 ORDER BY st.Name) A
JOIN
(SELECT cxref.StrainId as StrainId, group_concat(ca.Name, '=', cxref.Value) as "CaseAttributes"
FROM CaseAttributeXRefNew cxref LEFT JOIN CaseAttribute ca
ON ca.Id = cxref.CaseAttributeId 
GROUP BY InbredSetId, cxref.StrainId) B ON A.StrainId = B.StrainId;

See this answer for how a join was performed on 2 different queries:

Joining results of two mysql queries

Correlations

Correlations are slow. As of Tuesday April 4, 2023 at 1:37pm:

GN1 took *29 sec* (completed) vs GN2 *38 sec* (completed) for 1457545_at in the Hippocampus Consortium M430v2 (Jun06) PDNN

Hippocampus Consortium M430v2 (Jun06) PDNN

GN1 took *1.56 mins* (completed) vs GN2 *5.24 mins* (completed) for 10528873 in the UTHSC BXD Aged Hippocampus Affy Mouse Gene 1.0 ST (Sep12) RMA Exon Level

UTHSC BXD Aged Hippocampus Affy Mouse Gene 1.0 ST (Sep12) RMA Exon Level

Research Question: What effect would using LMDB have on correlations over text-file caching and sql fetches?