With the QC/Data Upload project nearing completion, and being placed in front of the initial user-testing cohort, we need a way for exporting all data that is uploaded into the RDF store, either at upload time, or a short time after.
Users will use the QC/Data upload project[1] to upload their data to GeneNetwork. This will mostly be numerical data in Tab-Separated-Values (.tsv) files.
Once this is done, we do want to have this data available to the user on GeneNetwork as soon as possible so that they can start doing their analyses with the data.
Following @Munyoki's work[2] on getting the data endpoints on GN3, it should, hypothetically, be possible for the user to simply upload the data, and using the GN3 API, immediately begin their analyses on the data. In practice, however, it will need that we export the uploaded data into LMDB, and possibly any related metadata into virtuoso to enable this to work.
This document explores what is needed to get that to work.
We can export the sample (numeric) data to LMDB with the "dataset->lmdb" project[3].
The project (as of 2023-11-14T10:12+03:00UTC) does not define an installable binary/script, and therefore cannot be simply added to the data upload project[1] as a dependency and invoked in the background.
The first line of the .tsv file uploaded is a header line indicating what each field is. The first field of the .tsv is a trait's name/identifier. All other fields are numerical strain/sample values for each line/record in the file.
A sample of a .tsv for upload
From
it looks like the each record/line/trait from the .tsv file will correspond to a "db-path" in the LMDB data store. This path could be of the form:
/path/to/lmdb/storage/directory/<group-or-inbredset>/<trait-name-or-identifier>/
where
**NB**: Verify this with @Munyoki
Immediately after upload of the data from the .tsv files, the data will most likely have very little metadata attached. Some of the metadata that is assured to be present is:
The metadata is useful for searching for the data. The "metadata->rdf" project[4] is used for exporting the metadata to RDF and will need to be used to initialise the metadata for newly uploaded data.