Edit this page | Blame

Precompute mapping input data

GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.

At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that tracks the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See

/topics/data/precompute/steps

The mariadb database gets (re)updated from the computed data, parsing metadata and forcing an update. We only need to handle state to track creating batches for (offline) compute. As this can happen on a single machine we don't need to do anything real fancy. Just write out the last updated traitid, i.e., DataId from ProbeSetXRef which is indexed. Before doing anything we use Locus_old to identify updates by setting it to NULL. See the header of list-traits-to-compute.scm.

Tasks

See

topics/data/precompute/steps

Next, for running the full batch:

[X] Store all GEMMA values efficiently
[ ] Include metadata record in lmdb and as JSON file
[ ] Include metadata record on compute status
[ ] Remove junk from tarball - use lmdb?
[ ] List significant markers as metadata in lmdb
[ ] Reread below info
[ ] Submit jobs to PBS using CCWL
[ ] Report results to mariadb

And after:

[ ] Track metadata of computed datasets (in RDF?)
[ ] Compute significance with GEMMA or other LMM (bulkLMM?)
[ ] Store significance and significant values for processing
[ ] Update search & correlations to use these
[ ] Further optimize computations so they can run continuously in the background

Info

Original qtlreaper version for PublishData

precompute-publishdata

Original qtlreaper version for ProbeSetData

The original reaper precompute lives in

https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py

More recent incarnations are at v8, including a PublishData version that can be found in

https://github.com/genenetwork/genenetwork2/tree/testing/scripts/maintenance

Note that the locations are on space:

cd /mount/space2/lily-clone/acenteno/GN-Data
ls -l
python QTL_Reaper_v8_space_good.py 116
--
python UPDATE_Mean_MySQL_tab.py
cd /mount/space2/lily-clone/gnshare/gn/web/webqtl/maintainance
ls -l
python QTL_Reaper_cal_lrs.py 7

The first task is to prepare an update script that can run a set at a time and compute GEMMA output (instead of reaper).

The script first fetches inbredsets

select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType,Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet LIMIT 5;
+----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
| Id | InbredSetId | InbredSetName     | Name     | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family                                           | FamilyOrder | MenuOrderId | InbredSetCode |
+----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
|  1 |           1 | BXD               | BXD      |         1 | BXD Family        |      2 | 1               | riset       | Reference Populations (replicate average, SE, N) |           1 |           0 | BXD           |
|  2 |           2 | B6D2F2 OHSU Brain | B6D2F2   |         1 | B6D2F2 OHSU Brain |      2 | 1               | intercross  | Crosses, AIL, HS                                 |           3 |           0 | NULL          |
|  4 |           4 | AXB/BXA           | AXBXA    |         1 | AXB/BXA Family    |      2 | 1               | NULL        | Reference Populations (replicate average, SE, N) |           1 |           0 | AXB           |
|  5 |           5 | AKXD              | AKXD     |         1 | AKXD Family       |      2 | 1               | NULL        | Reference Populations (replicate average, SE, N) |           1 |           0 | AKD           |
|  6 |           6 | B6BTBRF2          | B6BTBRF2 |         1 | B6BTBRF2          |      2 | 1               | intercross  | Crosses, AIL, HS                                 |           3 |           0 | BBT           |
+----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+

MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5;
+----+----------+
| Id | Name     |
+----+----------+
|  1 | BXD      |
|  2 | B6D2F2   |
|  4 | AXBXA    |
|  5 | AKXD     |
|  6 | B6BTBRF2 |
+----+----------+

and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today (let alone the latest?). Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id.

There are groups/collections, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06"

select max(distinct(ProbeSetFreezeId)) from ProbeSetXRef limit 5;
|                            1054 |

Next for this Id we fetch, known as `ProbeSetXRefInfos`:

MariaDB [db_webqtl]> select ProbeSetId, Locus, DataId from ProbeSetXRef where ProbeSetFreezeId=1 limit 5;
+------------+----------------+--------+
| ProbeSetId | Locus          | DataId |
+------------+----------------+--------+
|          1 | rs13480619     |      1 |
|          2 | rs29535974     |      2 |
|          3 | rs49742109     |      3 |
|          4 | rsm10000002321 |      4 |
|          5 | rsm10000019445 |      5 |
+------------+----------------+--------+

Then we fetch the trait values:

MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.Id = 1 limit 5;
+----+----------+-------+
| Id | StrainId | value |
+----+----------+-------+
|  1 |        1 | 5.742 |
|  1 |        2 | 5.006 |
|  1 |        3 | 6.079 |
|  1 |        4 | 6.414 |
|  1 |        5 | 4.885 |
+----+----------+-------+

with names:

MariaDB [db_webqtl]> select Strain.Name, ProbeSetData.value from Strain, ProbeSetData where Strain.Id = ProbeSetData.StrainId and ProbeSetData.Id = 1 limit 5;
+----------+-------+
| Name     | value |
+----------+-------+
| B6D2F1   | 5.742 |
| C57BL/6J | 5.006 |
| DBA/2J   | 6.079 |
| BXD1     | 6.414 |
| BXD2     | 4.885 |
+----------+-------+

35 strains were used for this dataset:

select count(ProbeSetData.value) from ProbeSetData where ProbeSetData.Id = 1;
+---------------------------+
| count(ProbeSetData.value) |
+---------------------------+
|                        35 |
+---------------------------+

with genotypes (from files) and these phenotypes (from MySQL) qtlreaper is started and next we update the values for

select * from ProbeSetXRef where ProbeSetId=1 and ProbeSetFreezeId=1 limit 5;
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old          | pValue_old | mean             |
se                  | Locus      | LRS             | pValue | additive    | h2   |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
|                1 |          1 |      1 | 10.095.400 | 13.3971627898894 |      0.163 | 5.48794285714286 |
0.08525787814808819 | rs13480619 | 12.590069931048 |  0.269 | -0.28515625 | NULL |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+

the actual update is

update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s

so Locus, LRS and additive fields are updated.

The old reaper scores are in

MariaDB [db_webqtl]> select ProbeSetId,ProbeSetFreezeId,Locus,LRS,additive from ProbeSetXRef limit 10;
+------------+------------------+----------------+------------------+--------------------+
| ProbeSetId | ProbeSetFreezeId | Locus          | LRS              | additive           |
+------------+------------------+----------------+------------------+--------------------+
|          1 |                1 | rs13480619     |  12.590069931048 |        -0.28515625 |
|          2 |                1 | rs29535974     | 10.5970737900941 | -0.116783333333333 |
|          3 |                1 | rs49742109     |  6.0970532702754 |  0.112957489878542 |
|          4 |                1 | rsm10000002321 | 11.7748675511731 | -0.157113725490196 |
|          5 |                1 | rsm10000019445 | 10.9232633740162 |  0.114764705882353 |
|          6 |                1 | rsm10000017255 | 8.45741703245224 | -0.200034412955466 |
|          7 |                1 | rs4178505      | 7.46477918183565 |  0.104331983805668 |
|          8 |                1 | rsm10000144086 | 12.1201771258006 | -0.134278431372548 |
|          9 |                1 | rsm10000014965 | 11.8837168740735 |  0.341458333333334 |
|         10 |                1 | rsm10000020208 | 10.2809848009836 | -0.173866666666667 |
+------------+------------------+----------------+------------------+--------------------+
10 rows in set (0.000 sec)

This means for every dataset one single maximum score gets stored(!)

From this exercise we can conclude:

Existing precomputed values are by linear regression (old QTLreaper)
Only one highest score is stored
Every time the script is run *all* datasets are recomputed
The '_old' named values in the table are not touched by the script
h2 and mean are not updated
No significance score is computed (needs permutations)

Rob voiced a wish to retain all scores (at least those above 1.0). This is not feasible in mariadb, but we can retain GEMMA output if stored efficiently.

Notes on ProbeSetXRef

ProbeSetXRef is pretty small, currently @5.6Gb and 48,307,650 rows, so we could decide to add columns to track different mappers. Something funny

select count(LRS) from ProbeSetXRef;
+------------+
| count(LRS) |
+------------+
|   28335955 |
+------------+

half the fields are used. What to think of

MariaDB [db_webqtl]> select * from ProbeSetXRef where LRS=0 limit 5;
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
| ProbeSetFreezeId | ProbeSetId | DataId   | Locus_old | LRS_old | pValue_old | mean | se   | Locus          | LRS  | pValue | additive | h2   |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
|              589 |    8599010 | 70606737 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8593710 | 70606738 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8607637 | 70606739 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8594490 | 70606740 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8591994 | 70606741 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+

MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where LRS=0 and Locus="rsm10000000001";
+----------+
| count(*) |
+----------+
|   291447 |
+----------+

There is obviously more. I think this table can use some cleaning up?

Looking at the GN1 code base ProbeSetXRef is used in a great number of files. In GN2 it is:

wqflask/maintenance/quantile_normalize.py
wqflask/maintenance/generate_probesetfreeze_file.py
wqflask/base/mrna_assay_tissue_data.py
wqflask/wqflask/update_search_results.py
wqflask/wqflask/correlation/rust_correlation.py
wqflask/base/trait.py
wqflask/wqflask/do_search.py
wqflask/base/data_set/mrnaassaydataset.py
wqflask/base/data_set/dataset.py
wqflask/wqflask/correlation/pre_computes.py
wqflask/wqflask/api/router.py
wqflask/wqflask/show_trait/show_trait.py
scripts/insert_expression_data.py
scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
scripts/maintenance/readProbeSetSE_v7.py
scripts/maintenance/QTL_Reaper_v6.py
scripts/maintenance/readProbeSetMean_v7.py
wqflask/tests/unit/base/test_mrna_assay_tissue_data.py

Let's visit these one by one to make sure there is no side-effect.

wqflask/maintenance/quantile_normalize.py

Appears to be a one-off to normalize a certain dataset in `/home/zas1024/cfw_data/`. Last time it was used is probably 2018.

wqflask/maintenance/generate_probesetfreeze_file.py

This one is even older and probably from 2013.

wqflask/base/mrna_assay_tissue_data.py

Another dinosaur, actually uses TissueProbeSetXRef, so not relevant.

wqflask/wqflask/update_search_results.py

This is for global 'gene' search making sure ProbeSet.Id = ProbeSetXRef.ProbeSetId AND ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id. LRS is passed on.

wqflask/wqflask/correlation/rust_correlation.py

Again making sure IDs match. LRS is passed on. This requires scrutiny FIXME

wqflask/base/trait.py

Trait class defines a trait in webqtl, can be either Microarray, Published phenotype, genotype, or user input trait. Fetches LRS.

wqflask/wqflask/do_search.py

Searches for genes with a QTL within the given LRS values

LRS searches can take 3 different forms: - LRS > (or <) min/max_LRS - LRS=(min_LRS max_LRS) - LRS=(min_LRS max_LRS chromosome start_Mb end_Mb) where min/max_LRS represent the range of LRS scores and start/end_Mb represent the range in megabases on the given chromosome

wqflask/base/data_set/mrnaassaydataset.py

Also search results.

wqflask/base/data_set/dataset.py

ID sync

wqflask/wqflask/correlation/pre_computes.py

ID sync

wqflask/wqflask/api/router.py

Fetch traits and sample data API (we may use this!)

wqflask/wqflask/show_trait/show_trait.py

Same.

The following are a punch of scripts:

scripts/insert_expression_data.py
scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
scripts/maintenance/readProbeSetSE_v7.py
scripts/maintenance/QTL_Reaper_v6.py
scripts/maintenance/readProbeSetMean_v7.py
wqflask/tests/unit/base/test_mrna_assay_tissue_data.py

Using LRS

wqflask/wqflask/static/new/javascript/lod_chart.js

LRS is used to display a LOD chart

and in a bunch of python files that match above list for ProbeSetXRef:

scripts/maintenance/QTL_Reaper_v6.py
wqflask/base/webqtlConfig.py
wqflask/base/trait.py
wqflask/tests/unit/wqflask/marker_regression/test_display_mapping_results.py
wqflask/tests/unit/base/test_trait.py
wqflask/wqflask/parser.py
wqflask/tests/unit/wqflask/correlation/test_show_corr_results.py
wqflask/wqflask/update_search_results.py
wqflask/wqflask/correlation/show_corr_results.py
wqflask/wqflask/marker_regression/run_mapping.py
wqflask/wqflask/export_traits.py
wqflask/wqflask/gsearch.py
wqflask/base/data_set/phenotypedataset.py
wqflask/base/data_set/markers.py
wqflask/wqflask/api/router.py
wqflask/base/data_set/mrnaassaydataset.py
wqflask/wqflask/correlation/rust_correlation.py
wqflask/wqflask/marker_regression/display_mapping_results.py
wqflask/wqflask/collect.py
wqflask/wqflask/do_search.py
wqflask/wqflask/views.py
wqflask/utility/Plot.py
test/requests/main_web_functionality.py
test/requests/correlation_tests.py

From above it can be concluded these precomputed values are used for display and for getting search results (filtering on datasets that have a QTL). It looks safe to start replacing qtlreaper results with GEMMA results. Only thing I am not sure about is correlations. A cursory inspection suggests LRS is only used for final output and that makes sense if correlations are done on (expression) phenotypes. Anyway, if things go haywire we'll find out soon (enough).

At the next stage we ignore all this and start precompute with GEMMA on the BXD.

Precompute DB

We will use a database to track precompute updates (see below).

On the computing host (or client) we should track the following:

Dataset (phenotypes)
time: MySQL time ProbeSetData table was last updated (if identical we can skip)
Hash on DB inputs (for DB table updates)
Genotypes
Algorithm
Hash on run inputs (phenotypes, genotypes, algorithm, invocation)
time: of initiate run
time: of completion
Hostname of run (this)
File path (this)
Hash on output data (for validation)
array of rec: DB hostnames + time stamps: successfully updated DB table for these servers

The logic is that if the DB table was changed we should recompute the hash on inputs. Note the ProbeSetData table is the largest at 200G including indices. If that hash changed the mapping algorithm should rerun. A new record will be created on the new inputs. A flag is set for the updated DB for precomputed values. We can garbage collect when multiple entries for `Dataset' exist.

What database should we use? Ultimately precompute is part of the main GN setup. So, one could argue MariaDB is a good choice. We would like to share precompute with other setups, however. Also we typically do not want to run the precompute on the production host. That means we should be able to rebuild the database from a precompute output directory and feed the update to the running (production) server. We want to track compute so we can distribute running the algorithms across servers and/or PBS. This implies the compute machines have to be able to query the DB in some way. Basically a machine has a 'runner' that checks the DB for updates and fetches phenotypes and genotypes. A run is started and on completion the DB is notified and updated. Note that a runner can not be parallel on a single results directory, so one runner per target output directory.

We can have different runners, one for local machine, one for PBS and one for remotes.

This also implies that 'state' is handled with the output dir (on a machine). We will create a JSON record for every compute. This state can be pushed into 'any' GN instance at any point. For running jobs the DB can be queried through a (same) REST API. The REST API has to handle the DB, so it has to be able to access DB and files, i.e. it should be on the same (production) machine.

On the DB we'll create a Hash table on inputs of ProbeSetData. This way we don't have to walk the giant table for every query. This can be handled through a CRON job. So there will be a table:

Dataset
Hash on relevant phenotypes (ProbeSetData)
time: Job started
host: host running job
status: (final) job status
time: DB updated

Tracking updates to DB

This brings us to CRON jobs. There are several things that ought to be updated when the DB changes. Xapian being one example and now this table. These should run on a regular basis and only update what really changed. We need to support that type of work out of the box with an hourly runner (for precompute) and a daily runner (for xapian).

One possibility is to use MySQL triggers to track changes to the database table. Even a hash can be computed as suggested here:

https://stackoverflow.com/questions/38732185/hash-of-two-columns-in-mysql

The problem is that this relative expensive computation done on every row update. I don't think it is feasible for updating phenotypes. Arthur may find the updates go very slow. It is possible, however, to just update a boolean value, or even a time stamp, that is cheap. That will probably increase the update time by 50-100% and that may be acceptable. We can only find out by trying.

But first I am going to simply generate the hashes from the running DB by scanning the full table. If that runs in minutes we should be good with a CRON job for now. Also, in the medium term, we may replace these tables with lmdb files and it is no good to depend on MysQL/MariaDB for this type of service.

Note that ProbeSetXRef is indexed on DataId, ProbeSetFreezeId, ProbeSetId and Locus. The following should be unique and ProbeSetFreezeId+ProbeSetId, indexed by unique DataId, is the actual dataset, so that should be reasonable fast.

`ProbeSetFreezeID` is a unique identifier to a dataset, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06". `ProbeSetId` is the trait in the collection, e.g. '100001_at'. `DataId` is the counter in ProbeSetXRef table.

So now this should be clear on update:

update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s

Every trait gets an LRS and effect size this way.

Writing script to Hash ProbeSetData

tux02 has a running database, so it is good for trials. The first step is to add hashes to ProbeSetData and compute them. Two questions we can ask: (1) should be host that data in the same table and (2) should we use another DB?

For (1) we should *not* use the same table. The current table is reasonably small and if we expand 28,335,955 rows on ProbeSetXRef it is going to be huge. This table is used a lot, so we should create a new table for this specific purpose. Now that means, even though this table is obviously linked to the running DB, we can actually use a different storage. And for this particular purpose I think sqlite will do great. Our preferred 'transient' databases, these days, are lmdb for disk b-trees, redis for in-RAM b-trees, sqlite for tables and virtuoso for RDF. This may change again, but for now these are great default choices. 'transient' means that the DB can be regenerated from some source of truth. This is not necessarily the case today - particularly authorization and session management are now using sqlite and redis as primary data sources. That is fine, as long as we are fully aware of the choices and what that implies.

We already use SQLite for authorization. Fred has been adding that to GN3 and now to its own authorization service:

https://github.com/genenetwork/gn-auth

In the README you can see the DB is located with a parameter 'AUTH_DB'. There is no example, so I check our production setup. Zach has it setup with

LOGLEVEL = "DEBUG"
SECRET_KEY = '9a4a2a...'
SQL_URI = "mysql://user:pwd@127.0.0.1:3306/db_webqtl"
AUTH_DB = "/home/gn2/auth_copy.db"
AUTH_MIGRATIONS = "/home/zas1024/gn-auth/migrations"

The DB file is world writeable - and we should probably change that. Also it needs a better storage location. Essentially GN should have a limited number of directories that handle state:

SQL server (mysql) - read-write
Other databases - read-write
Upload files - read-write
Reference files - read-only
Secret files - read-only
Cache files - read-write

Within these directories there should be clear subdirectories for each use case. So, for authorization we could have `/data/gn/databases/auth/production/auth.db'. Likewise, the precompute hash table will be in `/data/gn/databases/precompute/production/hashes.db' and we'll introduce an environment variable "GN_HASH_DB". We'll take care of that later.

For our hashing purposes we'll first create a DB named 'hashes.db'. This table can also be used for correlations etc. down the line by using the 'algorithm' field and its combined input hash. It is generic, in other words.

Getting precompute running

Now we have pretty much figured out what to do it is time to set up precompute. We can start with one single server that has access to mariadb. The new genoa Tux04 will be our replacement of Tux01. It has some larger Crucial P3 NVME SSDs rated at 3.5Gb/s. The Dell drive is 1.4 TB and is rated to be twice as fast. So we'll use that for the MariaDB now.

The first step is to make space for all the GN2+GN3 stuff. Next we test if we can query tux01 ports. tux04 can see tux01 because they are both inside the UTHSC firewall. I just need to allow the mysql port on tux01 and we keep using that DB for now.

The first phase is to simply start running gemma on tux05 and tux04 using the BXD. I mean, before computing 28,335,955 phenotypes we'll have to have a more serious compute setup!

MariaDB [db_webqtl]> select * from Strain where SpeciesId=1 limit 200;
+-----+------------+------------+-----------+--------+------------+
| Id  | Name       | Name2      | SpeciesId | Symbol | Alias      |
+-----+------------+------------+-----------+--------+------------+
|   1 | B6D2F1     | B6D2F1     |         1 | NULL   | NULL       |
|   2 | C57BL/6J   | C57BL/6J   |         1 | B6J    | P          |
|   3 | DBA/2J     | DBA/2J     |         1 | D2J    | P          |
|   4 | BXD1       | BXD1       |         1 | NULL   | NULL       |
|   5 | BXD2       | BXD2       |         1 | NULL   | NULL       |
|   6 | BXD5       | BXD5       |         1 | NULL   | NULL       |
|   7 | BXD6       | BXD6       |         1 | NULL   | NULL       |

select * from Strain where SpeciesId=1 and Name like '%BXD%';
returns 1371 rows

The originally named StrainXRef table links StrainID with SpeciesID:

MariaDB [db_webqtl]> select * from StrainXRef limit 3;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           1 |        1 |      10 | N                | MP_F1          |
|           1 |        2 |      20 | N                | Mat            |
|           1 |        3 |      30 | N                | Pat            |
+-------------+----------+---------+------------------+----------------+

Getting all BXD datasets

Together with the InbredSet table we have the information we need to fetch BXD datasets.

select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType | Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet limit 3;
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
| Id | InbredSetId | InbredSetName     | Name   | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
|  1 |           1 | BXD               | BXD    |         1 | BXD Family        |      2 | 1               |                    0 |           1 |           0 | BXD           |
|  2 |           2 | B6D2F2 OHSU Brain | B6D2F2 |         1 | B6D2F2 OHSU Brain |      2 | 1               |                    0 |           3 |           0 | NULL          |
|  4 |           4 | AXB/BXA           | AXBXA  |         1 | AXB/BXA Family    |      2 | 1               |                 NULL |           1 |           0 | AXB           |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+

MariaDB [db_webqtl]> select * from StrainXRef where InbredSetID=1 limit 5;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           1 |        1 |      10 | N                | MP_F1          |
|           1 |        2 |      20 | N                | Mat            |
|           1 |        3 |      30 | N                | Pat            |
|           1 |        4 |      40 | Y                | RI_progeny     |
|           1 |        5 |      50 | Y                | RI_progeny     |
+-------------+----------+---------+------------------+----------------+

select * from ProbeSetData where StrainId = 4 limit 5;

Fetches all data where strain `BXD1' was used. That is 14,829,435 of data points! Now it should be clear the database is designed to go from dataset -> strain and not the other way round. We can, however, walk ProbeSetXRef and get the ProbeSetId. Use that to query ProbeSetData and use the StrainID to make sure it belongs to the BXD for compute.

Next, the fields `Locus_old | LRS_old | pValue_old' are not used that I can tell from the source code. So, we could use these to track updates. I am going to set these _old fields in ProbeSetXRef. After checking the backups we will first set all Locus_old to NULL. If a value is NULL we can precompute. Afther precompute the current Locus_old value is taken from the reaper compute, as well as LRS_old and p_Value_old. We will add a column for additive_old.

Let's fetch one where Locus_old is now NULL:

select ProbeSetFreezeId,ProbeSetId,DataId,Locus_old,Locus from ProbeSetXRef where Locus_old is NULL limit 1;
+------------------+------------+--------+-----------+-----------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | Locus     |
+------------------+------------+--------+-----------+-----------+
|               31 |      48552 | 458384 | NULL      | D10Mit194 |
+------------------+------------+--------+-----------+-----------+
1 row in set (0.00 sec)

DataId is really ProbeSetDataId, so:

MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.id = 458384 limit 1;
+--------+----------+--------+
| Id     | StrainId | value  |
+--------+----------+--------+
| 458384 |       42 | 10.306 |
+--------+----------+--------+
1 row in set (0.00 sec)

And StrainId can be found in StrainXRef with

MariaDB [db_webqtl]> select * from Strain where Strain.id=42;
+----+---------+---------+-----------+--------+-------+
| Id | Name    | Name2   | SpeciesId | Symbol | Alias |
+----+---------+---------+-----------+--------+-------+
| 42 | CASE001 | CASE001 |         1 | NULL   | NULL  |
+----+---------+---------+-----------+--------+-------+

and

MariaDB [db_webqtl]> select * from StrainXRef where StrainId=42;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           2 |       42 |       1 | N                | NULL           |
+-------------+----------+---------+------------------+----------------+

MariaDB [db_webqtl]> select * from InbredSet where Id=2;;
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+
| Id | InbredSetId | InbredSetName     | Name   | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family           | FamilyOrder | MenuOrderId | InbredSetCode | Description |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+
|  2 |           2 | B6D2F2 OHSU Brain | B6D2F2 |         1 | B6D2F2 OHSU Brain |      2 | 1               | intercross  | Crosses, AIL, HS |           3 |           0 | NULL          | NULL        |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+

which is not part of the BXD (InbredSetId=1).

So, StrainXRef is crucial to get to the actual BXD. Let's look how it is used in existing code. In GN2 it is used in 2 places only: router and sample list:

SELECT Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId = 1;

lists all strain *names* based on Strain.id. The nomenclature is bogglingly bad. Anyway, it confirms that we can use this to see if something is part of the BXD.

Get all the BXD used for mapping

SELECT StrainId,Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId = 1 AND Used_for_mapping="Y" ORDER BY StrainId;

After some thought I decided we will create a new table that mirrors what we'll do with sqlite down the line. The new table will also allow for referring to time stamps and multiple hits. So, in addition to time stamps and hash values, we'll add to the above update record:

top hit info (locus, LRS, p-value, effect)
number of other hits
file with other hits

And down the line we can point to a different DB table that records significant hits.

As a first check it will be interesting to see how these values differ between qtlreaper and GEMMA!

Create Hash table

This new Hash table can be called 'SignificantHits' or something.

ProbeSetXRef is defined as

CREATE TABLE ProbeSetXRef (
  ProbeSetFreezeId smallint(5) unsigned NOT NULL DEFAULT 0,
  ProbeSetId int(10) unsigned NOT NULL DEFAULT 0,
  DataId int(10) unsigned NOT NULL DEFAULT 0,
  Locus_old char(20) DEFAULT NULL,
  LRS_old double DEFAULT NULL,
  pValue_old double DEFAULT NULL,
  mean double DEFAULT NULL,
  se double DEFAULT NULL,
  Locus varchar(50) DEFAULT NULL,
  LRS double DEFAULT NULL,
  pValue double DEFAULT NULL,
  additive double DEFAULT NULL,
  h2 float DEFAULT NULL,
  PRIMARY KEY (DataId),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Not all choices are that sane, but we'll leave it like it is for now (it is a small table). The new table will look like (LRS=Log wise ratios)

CREATE TABLE SignificantHits (
  Id INT UNSIGNED NOT NULL DEFAULT 0,
  ProbeSetFreezeId INT UNSIGNED NOT NULL DEFAULT 0, <- points to dataset info
  ProbeSetId INT UNSIGNED NOT NULL DEFAULT 0,       <- points to trait info
  ProbeSetDataId INT UNSIGNED NOT NULL DEFAULT 0,   <- id, strain value OR ID of the table?
  mean double DEFAULT NULL,
  se double DEFAULT NULL,
  Locus varchar(50) DEFAULT NULL,
  LRS double DEFAULT NULL,
  pValue double DEFAULT NULL,
  significant double DEFAULT NULL,
  additive double DEFAULT NULL,
  h2 float DEFAULT NULL,
  PRIMARY KEY (Id),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Note prepending Id and a rename from DataId to ProbeSetDataId because they refer to the same:

MariaDB [db_webqtl]> select max(DataId) from ProbeSetXRef limit 10;
| max(DataId) |
|    92213693 |

select max(Id) from ProbeSetData limit 3;
| max(Id)  |
| 92213693 |

In the future this table will be the replacement for the badly named ProbeSetXRef

Creating a list of work on tux04

We'll use a test table first on tux02. It should not matter for precompute because we'll regenerate hashes from the DB input and GEMMA output. I.e., we only need to recompute if something has changed in phenotypes or genotypes and that is quite rare.

We will work on tux04 and I'll first install the database there. In the process we'll need to get the basic environment going that we need for work. First I thought I'd copy the backup from space - as I thought it has 10Gbs, but actually it uses 1Gbs and the backup is a month old (need to fix that too!).

Alright, let's use tux02 for the DB. On the new machine tux04 we disable /etc/profile and /etc/bash.bashrc - people can add that themselves and it messes up guix profiles. After creating my environment it looks like any other machine I use and I have the same keyboard short cuts.

Note that my UID may be wrong for Octopus, but that requires a more heavy surgery anyway.

Next we create a drop for the backups from another machine following:

../backup_drops.gmi

Copy the backup which takes a while over the slow network and unpack with the great tool borg

time borg extract --progress /export/backup/bacchus/drop/tux01/borg-tux01::borg-backup-mariadb-20231111-06:39-Sat

Next install a recent Mariadb from guix as a container using the instructions in

move the database files into, for example, /export/mysql/var/lib/mysql. chown files to your user account. Next

cd /export/mysql
mkdir tmp
mkdir run
tux04:/export/mysql$ ~/opt/guix-pull/bin/guix shell -C -N coreutils sed mariadb --share=/export/mysql/var=/var --share=/export/mysql/tmp=/tmp
    export TMPDIR=/tmp
    mysqld_safe --datadir='/var/lib/mysql/' --protocol tcp  --port=3307 --user=$USER --group=users --nowatch

and a client with:

/export/mysql$ ~/opt/guix-pull/bin/guix shell mysql -- mysql --socket=var/run/mysqld/mysqld.sock -uwebqtlout -pwebqtlout db_webqtl

Modifying the tables

Set unused 'Locus_old' to NULL because we use that to update the existing table with the new values for production (note this is a quick hack discussed earlier):

update ProbeSetXRef set Locus_old=NULL;

SELECT DISTINCT DataId from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId = ProbeSetData.Id where StrainId>45 AND Locus_old is NULL limit 10;

Preparing for GEMMA

Meanwhile I have prepared tux04 and tux05 for the new runs. Next step is to query the DB and run GEMMA.

Testing a script to access the database

A first script on tux02 runs:

(use-modules (dbi dbi))

(define db_webqtl (dbi-open "mysql" "webqtlout:webqtlout:db_webqtl:tcp:127.0.0.1:3306"))
(dbi-query db_webqtl "SELECT * FROM ProbeSetXRef LIMIT 1")
(display (dbi-get_row db_webqtl))

((ProbeSetFreezeId . 1) (ProbeSetId . 1) (DataId . 1) (Locus_old . 10.095.400) (LRS_old . 13.3971627898894) (pValue_old . 0.163) (mean . 5.48794285714286) (se . 0.08525787814808819) (Locus . rs13480619) (LRS . 12.590069931048) (pValue . 0.269) (additive . -0.28515625) (h2 . 0.0))

Note that you can run this over guile-geiser remote.

Install and run gemma

Now we have a working DB query method on a compute node we can start running GEMMA. GEMMA is packaged as part of GNU Guix, so that part is straightforward. After above guix pull

. ~/opt/guix-pull/etc/profile <- don't do this, it does not work on Debian Guix images!
git clone https://git.genenetwork.org/guix-bioinformatics/
git clone https://gitlab.inria.fr/guix-hpc/guix-past
~/opt/guix-pull/bin/guix package -L ~/guix-bioinformatics/ -L ~/guix-past/modules -A gemma
  gemma-gn2       0.98.5-8cd4cdb  out   /home/wrk/guix-bioinformatics/gn/packages/gemma.scm:31:2

and install gemma with

wrk@tux05:~$ ~/opt/guix-pull/bin/guix package -L ~/guix-bioinformatics/ -L ~/guix-past/modules -i gemma -p ~/opt/gemma

Creating the precompute-runner

The precompute-runner can be part of `gn-guile/scripts/precompute`, i.e., the new GN guile repo's maintenance script dir:

https://git.genenetwork.org/gn-guile/tree/scripts/precompute

See the README.md for more information on running the script etc.

For development I'll tunnel to the Tux02 database.

As we are doing the BXD's first we first fetch a record from ProbeSetXRef that has Locus_old set to NULL AND matches a BXD trait.

First we fetch all BXD strainids. I wrote a function `bxd-strain-id-names` for that.

Next, using

https://git.genenetwork.org/gn-guile/commit/?id=b1db013cc01c94e27edf982be9b027a2b0bb9712

we fetch the first BXD dataset where all strains are members of BXD:

((Locus . rs13480619) (DataId . 1) (ProbeSetId . 1))1
WE HAVE OUR FIRST BXD DATASET for precompute!((1 . 5.742) (2 . 5.006) (3 . 6.079) (4 . 6.414) (5 . 4.885) (6 . 4.719) (7 . 5.761) (8 . 5.604) (9 . 5.661) (10 . 5.708) (11 . 5.628) (12 . 6.325) (13 . 5.37) (14 . 6.544) (15 . 5.476) (16 . 5.248) (17 . 5.528) (19 . 5.51) (20 . 5.886) (21 . 5.177) (22 . 5.655) (23 . 5.522) (24 . 5.549) (25 . 4.588) (26 . 5.618) (28 . 6.335) (29 . 5.569) (30 . 4.422) (31 . 5.194) (35 . 4.784) (36 . 5.056) (37 . 5.869) (39 . 5.175) (40 . 5.207) (41 . 5.264))

If it is not a match we'll need to iterate to the next one. The current version of precompute can fetch all the BXD datasets by name and trait name.

https://git.genenetwork.org/gn-guile/tree/scripts/precompute/precompute-hits.scm

For GEMMA we need to write a phenotype file which is simply a vector of values. Meanwhile, the genotype file and SNP files for the BXD are fixed (currently version 8 with 21.056 markers).

To run GEMMA we can take an example from `gemma-wrapper`, a tool I wrote to cache results:

https://github.com/genetics-statistics/gemma-wrapper

gemma-wrapper does a bit more work than we really need and is a bit too generic for this precompute exercise. The following things matter:

handling of hashes to prevent recompute
handling of the kinship matrix (K)
handling of leave one chromosome out (LOCO)

all of which are handled by gemma-wrapper really well. We'll use gemma-wrapper for the time being to do the actual computations.

BTW the cache for GN production is currently at 190G with 40K files. That is 3 months worth of caching GEMMA as we clean up older files! For precompute we may need to be smarter because that only represents 970 association results and 400 kinship matrices. We should improve GEMMA by:

by default only keep the required results
provide a more compact storage of results

We should be using something like lmdb as that will speed up lookups and GEMMA itself. Simple compression will reduce output 5x for one 7Mb output. Some K matrices are huge too. The largest one runs at 0.5 Gb. GEMMA does crash once in a while, unsurprisingly.

For the BXD we need to compute at least 100K traits - I have not done a proper count yet - and we can delete output files when the database gets updated. We should, however, hang on to hits that are potentially significant, so I want to retain output, or create a new mysql table.

Run gemma-wrapper

We can run a local version of gemma-wrapper with

.guix-shell --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper ruby -- ruby /gemma-wrapper/bin/gemma-wrapper

this is useful because we want to change gemma-wrapper itself. Mostly to produce less bulky output and add some metadata for downstream processing (i.e. to update the database itself). Note how versatile `guix shell` is for development!

gemma-wrapper shows the following output:

NOTE: gemma-wrapper is soon to be replaced
Using GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020

Found 0.98.3, comparing against expected v0.98.4
GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020
WARNING: GEMMA version is out of date. Update GEMMA to 0.98.4!

Some irony there. We are going to replace gemma-wrapper at some point. And the gemma that is packaged in guix by default it older that what we want to use. Though the release notes point out mostly maintenance updates:

https://github.com/genetics-statistics/GEMMA/releases

We can pull in a local gemma too using the same `--expose` trick above. So, let's use that for now:

.guix-shell --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper --expose=/home/wrk/iwrk/opensource/code/genetics/gemma/=/gemma ruby -- /gemma/bin/gemma

This does not work out of the box because I typically build gemma on a different Guix profile, so it won't find the libraries (RPATH is fixed). There are two options, build a static version of gemma or override the LD_LIBRARY_PATH. The first used to work, but Guix no longer carries static libs for openblas, so I would need to provide those myself. Overriding LD_LIBRARY_PATH is a bit inconvenient, but in this case doable. We have to run from inside the container and set:

env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib/ /gemma/bin/gemma

GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022

 type ./gemma -h [num] for detailed help
 options:
  1: quick guide
  2: file I/O related
  3: SNP QC
  ...

and success. With gemma wrapper we can pick up the latest gemma by setting the GEMMA_COMMAND

(system "env GEMMA_COMMAND=/gemma/bin/gemma /gemma-wrapper/bin/gemma-wrapper")
gemma-wrapper 0.99.6 (Ruby 3.2.2) by Pjotr Prins 2017-2022

NOTE: gemma-wrapper is soon to be replaced
Using GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022

All set for actual compute!

Running gemma

The following command fails because we don't have a pheno file yet:

gemma-wrapper --debug -- -gk -g BXD.8_geno.txt.gz -p pheno.txt -a BXD.8_snps.txt

Now the pheno file needs to match the number of genotypes in the geno file which, presumably, is generated from the file `BXD.8.gene`. It has the header:

BXD1	BXD2	BXD5	BXD6	BXD8	BXD9	BXD11	BXD12	BXD13	BXD14	BXD15	BXD16	BXD18	BXD19	BXD20	BXD21	BXD22	BXD23	BXD24	BXD24a	BXD25	BXD27	BXD28	BXD29	BXD30	BXD31	BXD32	BXD33	BXD34	BXD35	BXD36	BXD37	BXD38	BXD39	BXD40	BXD41	BXD42	BXD43	BXD44	BXD45	BXD48	BXD48a	BXD49	BXD50	BXD51	BXD52	BXD53	BXD54	BXD55	BXD56	BXD59	BXD60	BXD61	BXD62	BXD63	BXD64	BXD65	BXD65a	BXD65b	BXD66	BXD67	BXD68	BXD69	BXD70	BXD71	BXD72	BXD73	BXD73a	BXD73b	BXD74	BXD75	BXD76	BXD77	BXD78	BXD79	BXD81	BXD83	BXD84	BXD85	BXD86	BXD87	BXD88	BXD89	BXD90	BXD91	BXD93	BXD94	BXD95	BXD98	BXD99	BXD100	BXD101	BXD102	BXD104	BXD105	BXD106	BXD107	BXD108	BXD109	BXD110	BXD111	BXD112	BXD113	BXD114	BXD115	BXD116	BXD117	BXD119	BXD120	BXD121	BXD122	BXD123	BXD124	BXD125	BXD126	BXD127	BXD128	BXD128a	BXD130	BXD131	BXD132	BXD133	BXD134	BXD135	BXD136	BXD137	BXD138	BXD139	BXD141	BXD142	BXD144	BXD145	BXD146	BXD147	BXD148	BXD149	BXD150	BXD151	BXD152	BXD153	BXD154	BXD155	BXD156	BXD157	BXD160	BXD161	BXD162	BXD165	BXD168	BXD169	BXD170	BXD171	BXD172	BXD173	BXD174	BXD175	BXD176	BXD177	BXD178	BXD180	BXD181	BXD183	BXD184	BXD186	BXD187	BXD188	BXD189	BXD190	BXD191	BXD192	BXD193	BXD194	BXD195	BXD196	BXD197	BXD198	BXD199	BXD200	BXD201	BXD202	BXD203	BXD204	BXD205	BXD206	BXD207	BXD208	BXD209	BXD210	BXD211	BXD212	BXD213	BXD214	BXD215	BXD216	BXD217	BXD218	BXD219	BXD220	C57BL/6JxBXD037F1	BXD001xBXD065aF1	BXD009xBXD170F1	BXD009xBXD172F1	BXD012xBXD002F1	BXD012xBXD021F1	BXD020xBXD012F1	BXD021xBXD002F1	BXD024xBXD034F1	BXD032xBXD005F1	BXD032xBXD028F1	BXD032xBXD65bF1	BXD034xBXD024F1	BXD034xBXD073F1	BXD055xBXD074F1	BXD055xBXD65bF1	BXD061xBXD071F1	BXD062xBXD077F1	BXD065xBXD077F1	BXD069xBXD090F1	BXD071xBXD061F1	BXD073bxBXD065F1	BXD073bxBXD077F1	BXD073xBXD034F1	BXD073xBXD065F1	BXD073xBXD077F1	BXD074xBXD055F1	BXD077xBXD062F1	BXD083xBXD045F1	BXD087xBXD100F1	BXD065bxBXD055F1	BXD102xBXD077F1	BXD102xBXD73bF1	BXD170xBXD172F1	BXD172xBXD197F1	BXD197xBXD009F1	BXD197xBXD170F1

with 235 named BXD mice. Now when we check the DB we get

select * from StrainXRef WHERE InbredSetId=1 AND Used_for_mapping="Y"
276 rows

so it looks like we need some extra logic to fetch the used individuals from the actual genotype file(s).

For development I am running:

.guix-shell ruby --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper --expose=/home/wrk/iwrk/opensource/code/genetics/gemma/=/gemma -- guile -L . -s ./scripts/precompute/precompute-hits.scm

Writing the phenotype file

For gemma we need to feed a phenotype file that has only the individuals that are in the genotype file (the other 'missing' phenotype values should be NAs).

The genotype header in GN2 is read from a `BXD.json` file. Code is at wqflask/wqflask/marker_regression/run_mapping.py:get_genofile_samplelist(dataset)

Zach writes:

The JSON files were basically just a "stopgap" solution to allow for multiple genotype files for any given group (which previously wasn't possible), and they don't even exist for most groups - the code uses them only if they exist for a given group. They contain the title/label, filename, and sample lists for situations where the other genotype files only contain a subset of the samples from the full sample list, but otherwise (and usually) the samplelist is just fetched from the .geno file. It's also possible to fetch that subset of samples from the other/alternate .geno files, but it's easier to just fetch the samples as a list from that .json file than it is to parse it out of the other .geno files (and the JSON file would still be used to to get the "title" and filenames, so it's already being read in the code). Basically the JSON files are just a sort of "genotype file metadata" thing to deal with the existence of multiple genotype files for a single group and provide the "attributes" of the files in question (namely the title and filename - the sample lists were only added later once we started to get genotype files that didn't contain the full samplelist for the group in question). The "titles" appear in the drop-down on the trait page* - in their absence it will just use "{group_name}.geno" (for example BXD.geno). So if you deleted BXD.json, mapping would still work for BXD, but it would just default to BXD.geno and wouldn't give the option of other files. This information could probably also be stored somehow in the .geno file, but I'm not sure how to easily do so since that's a weird proprietary format and would seemingly require some sort of direct parsing in the GN code (while JSON is more simple to deal with).

https://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P

If you scroll down to mapping and Inspect Element on the Genotypes drop-down you'll see how the values are the actual filenames while the text is the title from the JSON file

So this is genotype metadata which could go into virtuoso (where all metadata belongs). Anyway, we'll get there when the time comes, but for now I can use it to see what genotypes we have in the genotype file and write the phenotype file accordingly.

I note we also have a GenoData table in mariadb which has allele info. The script

https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/load_genotypes.py

loads these from the BXD.geno-type files. As far as I can tell GenoData is not used in GN2+GN3. Other Geno files are basically used to fetch metadata on genotypes.

Genotype state lives in 4 places. Time to create a 5th one with lmdb ;). At least the BXD.json file lines up with BXD.8.geno and the BIMBAM version with 235 inds.

Using this information we created our first phenotype file and GEMMA run!

Successfully running gemma-wrapper

Running the wrapper is a two step process:

env TMPDIR=tmp ruby ./bin/gemma-wrapper --force --json \
        --loco -- \
        -g test/data/input/BXD_geno.txt.gz \
        -p test/data/input/BXD_pheno.txt \
        -a test/data/input/BXD_snps.txt \
        -gk > K.json
env TMPDIR=tmp ruby ./bin/gemma-wrapper --json --input K.json \
        -- \
        -g test/data/input/BXD_geno.txt.gz \
        -p test/data/input/BXD_pheno.txt \
        -a test/data/input/BXD_snps.txt \
        -lmm -maf 0.1 > G.json

The current LOCO approach leads to two files for every chromosome GRM and two files for every chr association output file. In this case 82 files with a total of 13Mb (4Mb compressed). That is a bit insane if you know the input is 300K, even knowing disk space is cheap(!) Cheap is not always cheap because we still need to process the data and with growing datasets the size grows rapidly overall.

So, this is the right time to put gemma-wrapper on a diet. The GRM files are largest. Currently we create kinship files for every population subset that is used and that may change once we simply reduce the final GRM by removing cols/rows. But that is one exercise we want to prove first using our precompute exercise. In this case we will simply compress the kinship files and that halves the size with zip. xz compression brings it down to 1/4. That is impressive by itself. I also checked lmza and bzip2 and they were no better. So, with gemma-wrapper we can now store the GRMs in an xz archive. For the assoc files we will cat them in to one file and compress that too, reducing the size to 1/7th. As noted above, the current cache size for GN is 190Gb for 3 months. We can reduce that significantly and that will speed up lookups. Decompression with xz is very fast.

Storing GRM output

The current version of gemma-wrapper stores per chromosome GRMs in separate files. The first fix was to store them in an xz archive. gemma-wrapper already uses a temporary directory so, that was straightforward. Next I had to tell gemma-wrapper not to recompute when the xz archive exists. This now works.

Next step is to use the archive to check the GWA run. In the final step we will archive results and write an lmdb file for further processing. If we can make it really small we'll retain that instead of the full archive. This code is part of gemma-wrapper right now:

https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma2lmdb.py

I looked at using SQLITE - which would have been a bit easier - but the size would have been at least double the size of lmdb (floats are stored as 8 bytes in SQLITE).

Inside the lmdb file we also store the results of the log files as a JSON 'meta' record. A metadata record is important because it allows for quick assessments later, such has how long the compute took at each step and how much memory was used. We collect metadata at every step. This means that we pass around a growing JSON object. Also we should be able to expose the JSON easily so it can be parsed with python, ruby, jq etc. With metadata we can store some extra information on the ProbeSet:

MariaDB [db_webqtl]> select Id,Chr,Mb,Name,Symbol,description from ProbeSet where Id=1 limit 1;
+----+------+-----------+-----------+--------+---------------------------------+
| Id | Chr  | Mb        | Name      | Symbol | description                     |
+----+------+-----------+-----------+--------+---------------------------------+
|  1 | 9    | 44.970689 | 100001_at | Cd3g   | CD3d antigen, gamma polypeptide |
+----+------+-----------+-----------+--------+---------------------------------+

and the dataset:

MariaDB [db_webqtl]> select * from ProbeSetFreeze WHERE Name='HC_M2_0606_P' limit 5;
+-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
| Id  | ProbeFreezeId | AvgID | Name         | Name2                              | FullName                                   | ShortName                         | CreateTime | OrderList | public | confidentiality | AuthorisedUsers | DataScale |
+-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
| 112 |            30 |     2 | HC_M2_0606_P | Hippocampus_M430_V2_BXD_PDNN_Jun06 | Hippocampus Consortium M430v2 (Jun06) PDNN | Hippocampus M430v2 BXD 06/06 PDNN | 2016-02-11 |         1 |      2 |               0 |                 | log2      |
+-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+

gemma-wrapper outputs JSON - but it is fairly elaborate so we'll reduce it to a minimum. At the time of lmdb creation we will pass in a small JSON file that describes the gemma-wrapper run.

Storing assoc output

To kick off precompute we added new nodes to the Octopus cluster: doubling its capacity. In the next step we have to compress the output of GEMMA so we can keep it forever. For this we want to have the peaks (obviously), but we als want to retain the 'shape' of the distribution - i.e., the QTL with sign. This shape we can use for correlations and potentially some AI-style mining. The way it is presented in AraQTL.

For the sign we can use the SNP additive effect estimate. The se of Beta is a function of MAF of the SNP. So if you want to present Beta as the SNP additive effect for a standardized genotype, then you want to use Beta/se; otherwise, Beta is the SNP additive effect for the original, unstandardized genotype. The Beta is obtained by controlling for population structure. For effect sign we need to check the incoming genotypes because they may have been switched.

Anyway, we can consider compressing the shape the way a CDROM is compressed. For now, in the next step we will compress the storage using xz - as discussed above. In the next step, after running the precompute and storing the highest hit in the DB, we will start using lmdb to store assoc values.

The precompute runs on tux04 with

tux04:~/services/gn-guile$ . .guix-shell ruby --expose=/home/wrk/services/gemma-wrapper=/gemma-wrapper --share=/export2/precompute-gemma -- env TMPDIR=/export2/precompute-gemma guile -L . -s ./scripts/precompute/precompute-hits.scm

Final tweaks

The precompute runs and updates the DB. We are creating a new precompute table that contains the highest hits, so we can track status of the runs. From this we can update ProbeSetXRef - rather than doing it directly.

That way we can also track the top hits for different computation.

For one result we have

  "meta": {
    "type": "gemma-wrapper",
    "version": "0.99.7-pre1",
    "population": "BXD",
    "name": "HC_U_0304_R",
    "trait": "101500_at",
    "url": "https://genenetwork.org/show_trait?trait_id=101500_at&dataset=HC_U_0304_R",
    "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz",
    "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-gemma-GWA.tar.xz",
    "dataid": 75629,
    "probesetid": 1097,
    "probesetfreezeid": 7
  }

Continue with steps:

/topics/data/precompute/steps

Notes

1 rsm10000000577 3.20 Chr1: 69.315198 -0.248

chr,pos,af,beta,se,l_mle,p_lrt

1,69315198,0.2800000011920929,0.49527430534362793,0.13411599397659302,100000.0,0.0006254952750168741 Math.log(0.0006254952750168741,10)

-3.203775966613006

-0.4952743053436279/2

-0.24763715267181394

5 rsm10000001990 3.11 Chr4: 81.464858 0.282

4,81464858,0.36500000953674316,-0.5631462931632996,0.15581850707530975,100000.0,0.0007759517757222056 Math.log(0.0007759517757222056,10)

-3.1101652686754857

0.563146/2

0.281573

The likelihood ratio is bounded between zero and one. The l_lrt runs between 0.0 and 1.0. The smaller the number the higher the log10 value (basically the number of digits; 0.000001 is -5.99

The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. In our case it is always exactly 10^5 or -10^5 for some reason. According to GEMMA doc that means it is a pure genetic effect when positive.

Xapian

We want to add the log values, effect size and number of individuals to the xapian search

NAs in GN

A note from Zach:

On Sat, Dec 09, 2023 at 06:09:56PM -0600, Zachary Sloan wrote:

(After typing the rest of this out, I realized that part of the confusion might be about how locations are stored. We don't actually database locations in the ProbeSetXRef table - we only database the peak Locus marker name. This is then cross-referenced against the Geno table, where the actual Location is stored. This is the main source of the problem. So I think the best short-term solution might be to just directly database the locations in the ProbeSetXRef table. Those locations might become out of date, but as you mention they'd still probably be in the same ballpark.)

It is logical to store the location with the peak - if it changes we should recompute. That also adds the idea that we should track the version of the genotypes in that table.

More complicated datasets

A good dataset to take apart is

http://genenetwork.org/show_trait?trait_id=1436869_at&dataset=HC_M2_0606_P

because it has 71 BXD samples and 32 other samples.

Variations

This paper discusses a number of approaches that may be interesting:

Automated quantitative trait locus analysis (AutoQTL)

Fetch strain IDs

The following join will fetch StrainID in a dataset

MariaDB [db_webqtl]> select StrainId, Locus, DataId, ProbeSetId, ProbeSetFreezeId from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId=ProbeSetData.Id where DataId>0 AND Locus_old is NULL ORDER BY DataId LIMIT 5;
+----------+-----------+--------+------------+------------------+
| StrainId | Locus     | DataId | ProbeSetId | ProbeSetFreezeId |
+----------+-----------+--------+------------+------------------+
|        1 | rs6394483 | 115467 |      13825 |                8 |
|        2 | rs6394483 | 115467 |      13825 |                8 |
|        3 | rs6394483 | 115467 |      13825 |                8 |
|        4 | rs6394483 | 115467 |      13825 |                8 |
|        5 | rs6394483 | 115467 |      13825 |                8 |
+----------+-----------+--------+------------+------------------+
5 rows in set (0.205 sec)

Count data sets

Using above line we can count the number of times BXD1 was used:

MariaDB [db_webqtl]> select count(StrainId) from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId=ProbeSetData.Id where DataId>0 AND Locus_old is NULL AND StrainId=1 ORDER BY DataId LIMIT 5;
+-----------------+
| count(StrainId) |
+-----------------+
|        10432197 |
+-----------------+
1 row in set (39.545 sec)

Or the main BXDs