We have:
We want to make that entire site is easily navigatable; and make all terms self-explanatory. Beyond that get rid of all the blank nodes. Think of navigation akin to menu navigation from a website.
Useful reference for the right queries from GN1:
From this root node, one should be able to navigate to species, inbred-set groups, and datasets (TODO).
Terms:
Note: In the main GN page, we don't list IndbredSet groups that don't have a family:
mysql> select count(*) FROM InbredSet where Family IS NULL; +----------+ | count(*) | +----------+ | 33 | +----------+ 1 row in set (0.01 sec)
Checking for trait and co-factors:
SELECT
s.Name AS species_name,
i.Id AS inbredset_id,
i.Name AS inbredset_name,
'Traits and Cofactors' AS dataset_type
FROM Species s
JOIN InbredSet i
ON i.SpeciesId = s.Id
JOIN PublishFreeze p
ON p.InbredSetId = i.Id
WHERE p.Name = CONCAT(i.Name, 'Publish');
Checking for DNA Markers and SNPs:
SELECT
s.Name AS species_name,
i.Id AS inbredset_id,
i.Name AS inbredset_name,
'DNA Markers and SNPs' AS dataset_type
FROM Species s
JOIN InbredSet i
ON i.SpeciesId = s.Id
JOIN GenoFreeze g
ON g.InbredSetId = i.Id
WHERE g.Name = CONCAT(i.Name, 'Geno');
Checking for Molecular Traits:
SELECT DISTINCT
s.Name AS species_name,
i.Name AS inbredset_name,
t.Name AS dataset_type,
psf.FullName as dataset_full_name
FROM Species s
JOIN InbredSet i
ON i.SpeciesId = s.Id
JOIN ProbeFreeze pf
ON pf.InbredSetId = i.Id
JOIN ProbeSetFreeze psf
ON psf.ProbeFreezeId = pf.Id
JOIN Tissue t
ON pf.TissueId = t.Id
WHERE s.Name = 'rat' AND i.Name = 'HXBBXH' AND
psf.public > 0 AND t.Name = 'Adipose mRNA' GROUP BY s.Name, i.Name
ORDER BY
s.Name, i.Name, t.Name, psf.FullName;
Markers that belong to more than one species:
SELECT
Geno.*,
COUNT(*) AS cnt
FROM Geno
GROUP BY
Name,
Marker_Name
HAVING COUNT(*) > 1;
From above results we can confirm:
SELECT Geno.* FROM Geno WHERE Marker_Name IN ("D11Mit2", "D11Mit2", "D12Mit1", "D3Mit17") ORDER BY Marker_Name\G
We see that the markers we have markers that can belong to more than one species.
Counting markers per public GenoFreeze:
SELECT
gf.Name AS GenoFreezeName,
COUNT(DISTINCT g.Marker_Name) AS MarkerCount
FROM GenoFreeze gf
INNER JOIN InbredSet i
ON i.Id = gf.InbredSetId
INNER JOIN Species s
ON s.Id = i.SpeciesId
INNER JOIN Geno g
ON g.SpeciesId = s.Id
WHERE
gf.public > 0
AND g.Marker_Name IS NOT NULL
GROUP BY
gf.Name
ORDER BY
MarkerCount DESC;
Results:
+-----------------------------+-------------+ | GenoFreezeName | MarkerCount | +-----------------------------+-------------+ | AD-cases-controls-MyersGeno | 367403 | | BDF2-2005Geno | 120531 | | BXD-MicturitionGeno | 120531 | | CTB6F2Geno | 120531 | | Linsenbardt-BoehmGeno | 120531 | | AXBXAGeno | 120531 | | B6MRLF2-D2MRLF2Geno | 120531 | | BXD-JAX-ADGeno | 120531 | | CCGeno | 120531 | | SOTNOT-OHSUGeno | 120531 | | B6D2F2-PSUGeno | 120531 | | BHHBF2Geno | 120531 | | BXDGeno | 120531 | | DOD-BXD-GWIGeno | 120531 | | BDF2-1999Geno | 120531 | | BXD-MBD-UTHSCGeno | 120531 | | UTHSC-CannabinoidGeno | 120531 | … | HET3-ITPGeno | 120531 | | MDPGeno | 120531 | | HSNIH-PalmerGeno | 29518 | | HRDP_HXB-BXH-BPGeno | 29518 | | NWU_WKYxF344_F2Geno | 29518 | | HXBBXHGeno | 29518 | | MAGIC_LinesGeno | 8933 | | J12XJ58F11Geno | 4938 | | SXMGeno | 792 | | ColXCviGeno | 133 | | BayXShaGeno | 133 | | ColXBurGeno | 133 | +-----------------------------+-------------+
Same markers. Not feasible to have a fan out from genotypes -> markers; too much repetition.
We only have 532,248 markers:
> SELECT COUNT(*) FROM Geno; +----------+ | count(*) | +----------+ | 532248 | +----------+
Instead, present the number of markers. Link the snps/dna-markers to species. Show how to access them.
Genotypes and markers are different but related. Different Species can have different markers
All probesets should have a name:
SELECT * FROM ProbeSet WHERE ProbeSet.Name IS NULL OR TRIM(ProbeSet.Name) = ''\G
Number of probesets we have:
MariaDB [db_webqtl]> select count(*) from ProbeSet; +----------+ | count(*) | +----------+ | 6436251 | +----------+
Number of experiment that use probesets:
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef; +----------+ | count(*) | +----------+ | 49131499 | +----------+
We can get away with tx'ing ProbeSet in one go. However, file size gets too big and rapper complains about it. Instead, figure out a way to tx ProbeSetXRef in chunks. Note: total transform times averages at about ~21 mins. With probesets/probesetxref, that will balloon upto >1hr. Not worried about optimising things now. That can be worked out for later. Building the probeset table only takes 105m12.350s. With the short form syntax, the size goes from 5G to 4G.
MariaDB [db_webqtl]> SELECT COUNT(*) AS column_count
-> FROM INFORMATION_SCHEMA.COLUMNS
-> WHERE TABLE_SCHEMA = DATABASE()
-> AND TABLE_NAME = 'ProbeSet';
+--------------+
| column_count |
+--------------+
| 72 |
+--------------+
1 row in set (0.00 sec)
-- Phenotypes: SELECT * FROM PublishFreeze WHERE CreateTime > '2020-01-01' AND (public < 1 OR confidentiality > 0); -- Genotypes: SELECT * FROM GenoFreeze WHERE CreateTime > '2020-01-01' AND (public < 1 OR confidentiality > 0); -- Probesets: SELECT * FROM ProbeSetFreeze WHERE CreateTime > '2020-01-01' AND (public < 1 OR confidentiality > 0)\G