Edit this page | Blame

RDF Refinement

Description

We have:

We want to make that entire site is easily navigatable; and make all terms self-explanatory. Beyond that get rid of all the blank nodes. Think of navigation akin to menu navigation from a website.

Useful reference for the right queries from GN1:

Tasks

Species/InbredSet/MappingMethod/AvgMethod

From this root node, one should be able to navigate to species, inbred-set groups, and datasets (TODO).

  • [X] Use resolvable mapping method.
  • [X] Create ttl file for more metadata around mapping/avg methods used in GN.
  • [X] Made gnc:probeset/gnc:{g/ph}enotype skos:Concept
  • [X] Add description for mapping/averaging nodes.
  • [X] Replace skos:prefLabel/skos:label with skos:definition for all "gnc:"
  • [X] Species page - fan out to all inbredsets that belong to that species.
  • [X] Get rid of blanknodes in gnc:resource_classification_scheme
  • [X] Add schema:domainIncludes for dcat:Dataset and gnc:species
  • [X] Add special nodes for family.
  • [X] Fan out from family page.
  • [X] gnt:belongs_to_species gnt:has_species
  • [X] Generate wikidata fan-out links for FamilyOrder.
  • [X] Remodel the gn hierachy: inbredset group -> population category -> species -> taxonomic family.

Terms:

  • [X] xkos:specializes/generalizes -> xkos:previousLevel/nextLevel
  • [X] gnt:family -> gnt:has_family
  • [X] gnt:genetic-type -> gnt:genetic_type
  • [X] skos:notation -> gnt:has_uniprot_taxon_id
  • [X] <species> rdf:isDefinedBy -> <species> gnt:has_wikidata_link
  • [X] gnt:is_species_of -> gnt:has_strain
  • [X] gnt:belongs_to_group -> gnt:has_strain
  • [X] Use latest inbredset table from tux02 (local db had missing "public" and "description" tables)
  • [X] Update term metadata for "gnt:has_family."
  • [X] Create gnc:family
  • [X] Link family to species/inbredsets.
  • [X] Wrap inbredset metadata - HTML rich text format - in ^^rdf:HTML
  • [X] Mark monkey data as cruft with gn:cruft and add comments in rdfs:comment
  • [X] Transform renames.
  • [X] Move oll ontology to one file.

Note: In the main GN page, we don't list IndbredSet groups that don't have a family:

mysql> select count(*) FROM InbredSet where Family IS NULL;
+----------+
| count(*) |
+----------+
|       33 |
+----------+
1 row in set (0.01 sec)

Datasets

  • [X] gn:datasets->metadata
  • [ ] Share datasets older than 2025 to Rob & Pjotr.

Checking for trait and co-factors:

SELECT
    s.Name AS species_name,
    i.Id   AS inbredset_id,
    i.Name AS inbredset_name,
    'Traits and Cofactors' AS dataset_type
FROM Species s
JOIN InbredSet i
  ON i.SpeciesId = s.Id
JOIN PublishFreeze p
  ON p.InbredSetId = i.Id
WHERE p.Name = CONCAT(i.Name, 'Publish');

Checking for DNA Markers and SNPs:

SELECT
    s.Name AS species_name,
    i.Id   AS inbredset_id,
    i.Name AS inbredset_name,
    'DNA Markers and SNPs' AS dataset_type
FROM Species s
JOIN InbredSet i
  ON i.SpeciesId = s.Id
JOIN GenoFreeze g
  ON g.InbredSetId = i.Id
WHERE g.Name = CONCAT(i.Name, 'Geno');

Checking for Molecular Traits:

SELECT DISTINCT
    s.Name AS species_name,
    i.Name AS inbredset_name,
    t.Name AS dataset_type,
    psf.FullName as dataset_full_name
FROM Species s
JOIN InbredSet i
  ON i.SpeciesId = s.Id
JOIN ProbeFreeze pf
  ON pf.InbredSetId = i.Id
JOIN ProbeSetFreeze psf
  ON psf.ProbeFreezeId = pf.Id
JOIN Tissue t
  ON pf.TissueId = t.Id
WHERE s.Name = 'rat' AND i.Name = 'HXBBXH' AND
psf.public > 0 AND t.Name = 'Adipose mRNA' GROUP BY s.Name, i.Name
ORDER BY
s.Name, i.Name, t.Name, psf.FullName;
  • [ ] (w/ Alex) Link dataset view from RDF -> LMDB
  • [ ] Figure out table fan-out with existing data.

Molecular Traits Dataset

  • [X] Ontology for describing tissue.
  • [X] Move investigators to own file.
  • [X] Delete old probeset definitions.
  • [X] Link all datasets to type and family.
  • [X] Remodel gene-chip metadata.
  • [X] Refactor molecular-traits.scm to fetch metadata from Datasets table.
  • [X] Add missing definitions for gnc:has_probeset_data.
  • [X] Refactor gn:dataset->metadata.
  • [X] Remove duplicate queries.
  • [X] gn:dataset->metadata (only molecular traits have normalization, avg )
  • [X] gn:molecular-trait->gn:dataset
  • [X] gn:set->gn:dataset
  • [X] gnc:molecular_trait->gn:molecular_trait
  • [ ] LMDB Data ∀ traits.

Genotype Dataset

  • [X] gn:set->gn:dataset (gnt:has_genotype_data)
  • [X] gn:dataset->set (gnt:has_strain)
  • [ ] (?) LMDB Data ∀ traits.
  • [-] (cancelled) gn:set->markers

Markers that belong to more than one species:

SELECT
    Geno.*,
    COUNT(*) AS cnt
FROM Geno
GROUP BY
    Name,
    Marker_Name
HAVING COUNT(*) > 1;

From above results we can confirm:

SELECT Geno.* FROM Geno WHERE Marker_Name IN ("D11Mit2", "D11Mit2", "D12Mit1", "D3Mit17") ORDER BY Marker_Name\G

We see that the markers we have markers that can belong to more than one species.

Counting markers per public GenoFreeze:

SELECT
    gf.Name AS GenoFreezeName,
    COUNT(DISTINCT g.Marker_Name) AS MarkerCount
FROM GenoFreeze gf
INNER JOIN InbredSet i
    ON i.Id = gf.InbredSetId
INNER JOIN Species s
    ON s.Id = i.SpeciesId
INNER JOIN Geno g
    ON g.SpeciesId = s.Id
WHERE
    gf.public > 0
    AND g.Marker_Name IS NOT NULL
GROUP BY
    gf.Name
ORDER BY
    MarkerCount DESC;

Results:

+-----------------------------+-------------+
| GenoFreezeName              | MarkerCount |
+-----------------------------+-------------+
| AD-cases-controls-MyersGeno |      367403 |
| BDF2-2005Geno               |      120531 |
| BXD-MicturitionGeno         |      120531 |
| CTB6F2Geno                  |      120531 |
| Linsenbardt-BoehmGeno       |      120531 |
| AXBXAGeno                   |      120531 |
| B6MRLF2-D2MRLF2Geno         |      120531 |
| BXD-JAX-ADGeno              |      120531 |
| CCGeno                      |      120531 |
| SOTNOT-OHSUGeno             |      120531 |
| B6D2F2-PSUGeno              |      120531 |
| BHHBF2Geno                  |      120531 |
| BXDGeno                     |      120531 |
| DOD-BXD-GWIGeno             |      120531 |
| BDF2-1999Geno               |      120531 |
| BXD-MBD-UTHSCGeno           |      120531 |
| UTHSC-CannabinoidGeno       |      120531 |

…
| HET3-ITPGeno                |      120531 |
| MDPGeno                     |      120531 |
| HSNIH-PalmerGeno            |       29518 |
| HRDP_HXB-BXH-BPGeno         |       29518 |
| NWU_WKYxF344_F2Geno         |       29518 |
| HXBBXHGeno                  |       29518 |
| MAGIC_LinesGeno             |        8933 |
| J12XJ58F11Geno              |        4938 |
| SXMGeno                     |         792 |
| ColXCviGeno                 |         133 |
| BayXShaGeno                 |         133 |
| ColXBurGeno                 |         133 |
+-----------------------------+-------------+

Same markers. Not feasible to have a fan out from genotypes -> markers; too much repetition.

We only have 532,248 markers:

> SELECT COUNT(*) FROM Geno;

+----------+
| count(*) |
+----------+
|   532248 |
+----------+

Instead, present the number of markers. Link the snps/dna-markers to species. Show how to access them.

  • [ ] gn-dataset -> marker_count/example-query
  • [ ] markers -> metadata

Phenotype Dataset

  • [X] gn:phenotype->metadata
  • [X] Figure out linking phenotypes LRS and other stats metadata without using a blank node (and dup)
  • [X] LMDB Data ∀ traits.
  • [X] gn:set->gn:dataset (gnt:has_phenotype_data)
  • [X] gn:dataset->set (gnt:has_strain)

Data entry error in:

Phenotypes / RIF / Case Attributes / Individual Strains / Publications

Genotypes and markers are different but related. Different Species can have different markers

  • [X] Phenotypes.
  • [X] Publications.
  • [X] gn:set->gn:dataset.
  • [X] gn:dataset->gn:trait
  • [-] probesets
  • [ ] genotypes

Link experiment data from LMDB / Compute Data / RIF / Case Attributes

  • [ ] (GEMMA/ Rqtl) Compute data.
  • [ ] RIF.
  • [ ] Case Attributes.
  • [ ] Individual Strains.

GN Ontology

  • [ ] Create endpoints that lists all "gnt:" and "gnc:" terms
  • [ ] Create aliases for ontology.

Private data / Extras

  • [ ] Add sparql queries as an example.
  • [ ] ! Generate a list of data older than 2020 and ping Rob/Pjotr.

Post Mark-up

  • [ ] Re-visit how we store all HTML metadata. Clean this up.
  • [ ] Sync mariadb tux01 with tux02; have rdf.genenetwork.org be the latest.
  • [ ] Make sure that the rdf.genenetwork.org named graph is available on public end-point (mention to Fred about the nuance of moving to a new graph without breaking CD/Prod from old code that used the old genenetwork.org graph).
(made with skribilo)