Add GeneList to Metadata


  • assigned: bonfacem
  • priority: high
  • type: RDF
  • keywords: virtuoso


The GeneList table in GN fetches metadata about given Genes under Resource Links.


When transforming data, it's unclear how resources (GeneMANIA, STRING, PANTHER, etc.) become links---they are manually constructed in GN's source code. This transformation is crucial when converting data to RDF.

GeneList Metadata

Consider GN's approach for fetching GeneList entries for a specific trait.

The GeneList table lacks unique GeneSymbols and GeneIds, as illustrated in the following examples:

SELECT * FROM GeneList WHERE SpeciesId = 1 AND GeneSymbol = "Sp3" AND GeneId = 20687 AND Chromosome = "2"\G

Duplicate entry examples:

SELECT * FROM GeneList WHERE GeneSymbol = "AB102723" AND 
GeneId=3070 AND SpeciesId = 4 \G

SELECT * FROM GeneList WHERE SpeciesId = 1 AND GeneSymbol = "Sp3" AND GeneId = 20687 AND Chromosome = "2"\G

Identifying duplicates:

SELECT GeneSymbol, GeneId, SpeciesId, COUNT(CONCAT(GeneSymbol, "_", GeneId, "_", SpeciesId)) AS `count` FROM GeneList GROUP BY BINARY GeneSymbol, GeneId, chromosome, txStart, txEnd HAVING COUNT(CONCAT(GeneSymbol, "_", GeneId, "_", SpeciesId)) > 1;

Unique Gene Identifiers

In the GeneList table, some genes share GeneIds and GeneSymbols. GeneIds are unique within a species, while GeneSymbols are unique across species. In cases where GeneSymbols and GeneIDs match, different AlignIDs exist. To create unique identifiers for genes in the GeneList table, we use a query like:

SELECT CONCAT_WS("_", GeneSymbol, GeneID, AlignID) FROM GeneList;

For the GeneList_rn33 table, due to ambiguous cases, we rely on the table's id as a unique identifier. Here's an example of duplicate entries for a gene, differing only in txStart/txEnd/cdsStart/cdsEnd/exonStarts/exonEnd values:

SELECT * FROM GeneList_rn33 WHERE geneSymbol="Cbara1" AND NM_ID="NM_199412"\G
