GeneNetwork Uploader Requirements

*OR*: "Reattaching the Head on Frederick's Headless-Chicken Development :-)"

Introduction

I (Frederick M. Muriithi) have been building the

GeneNetwork Data Uploader

project.

As part of that work, we have come across a number of both implicit and explicit requirements to facilitate the end-goal of allowing users to upload new data to the system. This document discusses these requirements, while also offering up some possible solutions for some of the requirements.

NOTE: This is an evolving document, and will change as the requirements, technology and understanding changes.

Direction

The purpose of the system is to allow users to upload their data and be able to analyse it with the system.

There are two major schools of thought regarding this:

Basic upload of data for testing and analysis, with (a) curation step(s) later to make it conform to standards
Upload with strict checks to ensure data conforms to standards before upload is successful

The first of the schools of thought will allow the users to upload data and play around with it, correcting any errors before it is finally acceptable enough to upload to the main database. This implies use of a data staging area, or even a separate testing database to hold the data. There might need to be a GeneNetwork system with access to the staging area or testing database to allow the user(s) to do analysis of these "incomplete" data.

The second school of thought requires that the data the user uploads be complete, i.e. the numerical data is correct and there is complete accompanying metadata for the data, including descriptions that fulfil all requirements. We might need a curation step here too.

General Notes

We should probably start modifying our tables to use FULL IDENTIFIERS for the data. The full identifier includes ALL identifiers for any parent tables/data in addition to the specific record identifier.

e.g. For data, say in a ProbeSet table, instead of simply ProbeSet(`Id`) as an identifier, the complete identifier would be something like:

ProbeSet(`SpeciesId`, `PlatformId`, `Id`)

while for ProbeFreeze table would be something like:

ProbeFreeze(`SpeciesId`, `PlatformId`, `PopulationId`, `ProbeSetId`, `Id`)

and for data in ProbeSetData, it would be something like:

ProbeSetData(`SpeciesId`, `PlatformId`, `PopulationId`, `ProbeSetId`, `ProbeFreezeId`, `ProbeSetFreezeId`)

We can then have table indexes composed of one or more of the elements of the *FULL IDENTIFIER* for faster queries.

**NOTE 01**: The FULL IDENTIFIERS above should be hieararchical, beginning with the "oldest" ancestor and ending with the current record's ID.

**NOTE 02**: The examples of the FULL IDENTIFIERS above might not be complete. I'll update them as I tease more information from the database.

Data Categories

There are different data "categories" that could be uploaded into the system, some of which are dependent on others already existing on the system, before they can be uploaded. The "categories" are:

Species information

All the various data of interest to the system are grouped under one species or another. This means that there is a possibility that the user could want to upload data that belongs to a species that does not already exist on GeneNetwork. We might therefore need a way to allow the upload of new Species information, maybe with a verification step before the data hits the database.

Species --> {{{ data of various sorts }}}

The important species information we need to collect are:

Species name e.g. mouse, rat, blue gum
Scientific name e.g. Mus musculus, Rattus norvegicus, Eucalyptus globulus, etc
Family e.g. Vertebrates, Plants, etc.

Maybe we should do the whole "Classification" thing (Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species) e.g. for mouse, Eukaryota, Animalia, Chordata, Mammalia, Rodentia, Muridae, Mus, Muscululus. I do note, however, that this might be too much also, what with the sub-phylums and sub-genus, etc.

There is a benefit, however, of having such information: we can index the data by some of these fields, enabling us to query by say, 'Class' to quickly get all species of mammals in the system.

Platform Information

Hierarchy

Species --> Platform --> {{{ data of various sorts }}}

These are (sequencing?) platforms used for the generation of the data that is then to be uploaded. Some of the platforms are registered with

NCBI's Gene Expression Omnibus system

and some are not.

Each platforms also seem to be tied to a specific organism/species (please confirm).

The information regarding the platforms that we need is:

SpeciesId (see 'Species information' above)
Platform Name
Name
Title
GEOPlatform (Optional): NCBI's GEO indentifier for the platform e.g. GPL34216

rules for platform names.

The Platform details on NCBI also include sample data. If user specifies a GEO platform ID, we could fetch such details and auto-populate relevant tables, perhaps. We'd (well, mostly I, Fred) need to figure out whether NCBI provides an API for such.

Genotype Information

Hierarchy

Species --> Genotype --> {{{ data of various sorts }}}

SpeciesId (see 'Species information' above): This is an internal identifier for the Species information we have collected before.
Name: Name of genotype
Marker name:
Chromosome:
Megabases: This is location information in megabase pairs
Sequence:
Source: Provider of the information, e.g. an institute, person, etc. Is this

We could index the genotype information by the following fields:

SpeciesId: For faster queries for a particular species' genotypes
...

Assembly Information

mm8
mm10
mm11
...

etc.

I still do not wholly comprehend this. This might be related to the platform information.

From the 'Geno' table, I see the fields 'Mb_mm8' and 'Chr_mm8' that indicate that this information can affect data possibly downstream to the Geno data.

We probably need a way to separate these from the Geno table, while maintaining the link to the downstream data.

Tables affected by this information:

Geno
Chr_Length
...

Population Information

This is the second major organisational grouping of the data under the Species, i.e. data is organised hierachically under Species, then Population:

Species --> Population --> {{{ data of various sorts }}}

SpeciesId (see 'Species information' above)
Population name: InbredSetName, Name, and FullName. What are the differences?
GeneticType: e.g. riset, intercross, etc.
Family:
MappingMethodId:

Samples/Cases/Individuals Information

Hierarchy

Species --> Population --> Samples --> {{{ data of various sorts }}}

The data we need to collect/have for the samples are:

SpeciesId (see 'Species Information' above)
PopulationId (see 'Population Information' above)
Name: Official sample name/symbol
Alias: An alias for the sample (Optional)
Symbol: short strain symbol used in graphs and tables - looks like a display thing; look into this.

** Samples might also be related to the platform: see 'Platform Information' above.

From the existing `Strain` table, it seems you can only have one-and-only-one sample for a particular species with a specific name.

MariaDB [db_webqtl]> SHOW CREATE TABLE Strain; ... | Strain | CREATE TABLE `Strain` ( `Id` int(20) NOT NULL AUTO_INCREMENT, `Name` varchar(100) DEFAULT NULL, `Name2` varchar(100) DEFAULT NULL, `SpeciesId` smallint(5) unsigned NOT NULL DEFAULT 0, `Symbol` varchar(20) DEFAULT NULL, `Alias` varchar(255) DEFAULT NULL, PRIMARY KEY (`Id`), UNIQUE KEY `Name` (`Name`,`SpeciesId`), KEY `Symbol` (`Symbol`) ) ENGINE=InnoDB AUTO_INCREMENT=180927 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci | ...

We could index this information by any one, or combinations of the following fields:

SpeciesId
PopulationId

and maybe even drop the need for the 'StrainXRef' table. (*To be considered*)

Tissue Information

Hierarchy

Species --> ?? ... ?? --> Tissue --> {{{ data of various sorts }}}

Felix discovered the need for this when uploading the Arabidopsis Thaliana data into the test database with the uploader. Expression data to be uploaded has to be linked to a tissue, and the existing tissue information (as of before 2024-02-22T09:45+03:00UTC) seems to only belong to vertebrates, not plants.

**Find out more about tissue and linkages to other data**

Tables:

Tissue
TissueProbeFreeze
TissueProbeSetData
TissueProbeSetFreeze
TissueProbeSetXRef

...

Expression Data Information

Hierarchy

Species --> ?? ... ?? --> Expression Data --> {{{ data of various sorts }}}

The ' --> ?? ... ?? --> ' section winds through Platform, Population, Genotype, Tissue, Samples etc before making its way to the expression data information. I still need to unwind the hieararchy and list the paths here.

Affects the following database tables:

ProbeSet
ProbeFreeze
ProbeSetXRef
ProbeSetData: Data matrix - numerical values for use in analyses
ProbeSetSE: Standard error values for use in analyses

Some mandatory data we need:

SpeciesId (see 'Species Information' above)
PlatformId (see 'Platform Information' above)
Name: Phenotype identifier for the platform above
Gene Symbol: ...
Chromosome:
Megabases:
Description: A description for the phenotype
GeneId: Entrez gene ID from NCBI
Strand_Gene/Strand_Probe: he DNA strand (+ or -) of the gene assigned to the phenotype. Leading or lagging strand.

Maybe the *Chromosome* and *Megabases* value could be replaced by a single link to a ChromosomeId or such... maybe a table linking the chromosome to its specific assembly e.g.

Probeset(ChromosomeAssemblyId) --> (Id)ChromosomeAssembly(ChromosomeId) --> Chromosome(Id)

...

Publish Phenotype Data

We need a way for the uploader to distinguish between "Expression Data Phenotypes" and these "Publish Phenotypes".

I have not previously dealt with uploading "Publish Phenotype" (or "Classic Phenotype") data. This section begins an exploration on how that would come about.

Database tables affected:

Phenotype
Publication
PublishData
PublishFreeze: Links to population
PublishSE
PublishXRef: Links to population, phenotype, publication, and PublishData

These have a form very similar to the expression data.

Some important data required:

Units: Units of measurement for the phenotype

Description for "Publish Phenotypes"

Others? ...

...

Descriptions

For "Publish Phenotypes", the descriptions have strict requirement, listed at the link below:

https://info.genenetwork.org/faq.php#q-22

The list of "General Category and Ontology Terms" as of 2024-02-22T11:58+03:00UTC

is only usable for vertebrates. We will need an extended list for other species, e.g. plant species, invertebrates, etc.

Display

Data should be saved to the database with as accurate and precise information as possible. This means the data in the database could have a large-ish number of decimal places.

The UI (User interface) should then truncate or round off those decimal places as needed to give the user a nice display of the data, and maybe a table or key with the non-modified values as necessary.

GeneNetwork Uploader Requirements

Tags

Introduction

Direction

General Notes

Data Categories

Species information

Platform Information

Genotype Information

Assembly Information

Population Information

Samples/Cases/Individuals Information

Tissue Information

Expression Data Information

Publish Phenotype Data

Descriptions

Display