Correlation in gn2 has regressed when compared gn1
Issue experienced by users include
[x] separation of concerns split between correlation code and code to database part for easier debug
[x] optimize db queries
[x] Cache for huge datasets in text files
[x] Cache for traits metadata
[x] refactor data structures used
[x] limit number of results rendered to user
[] implement parralel computation for correlation
[] Server side pagination
As Rob has pointed out before, gn2 is much much slower than gn1. Before, we mistakenly thought that it was because that it only computed one of the correlations; but Zach correctly pointed out that it, gn1, did in fact still compute all correlations in a similar fashion to gn2.
The problems we have with gn2 are 2-fold:
- Slow computations - UI crashing on our users for huge datasets
We took a step back; tried to probe deeper how we do correlations. To do a correlation, we need to run a query on the entire dataset. After running a query on this dataset, we additionally fetch metadata on this dataset as seen here:
This takes a long time: it's our biggest bottleneck.
For sample correlation we call this function to fetch the data:
IMO this seems to be the main issue among all queries.
For tissue correlation we call this function to fetch the data this doesn't take much time less than 20 seconds to create instance and fetch results.
For lit correlation, we fetch the correlation from the DB no computation happens
Assume a user selects "sample correlation" in the form with limit 2000, they will fetch the results for the entire sample dataset to compute the sample correlation; then filter the top 2000 traits. Fetch the tissue input for them then do the correlation then fetch lit results for them.
ATM, we know that our datasets are immutable unless @Acenteno updates things. So why don't we just cache the results of such queries in Redis, or in some json file. And use those instead of running the query on every computation? A file look-up or a Redis look-up would be much faster than what we already have.
Also, another thing that could be improved on is replacing some basic data-structures used during the computations with more efficient ones. As an example, it makes little sense to use a list that holds a huge number of elements, when we could use a generator instead, or depending on the application, a more appropriate structure. That could shave some more seconds.
Something else worth mentioning is that the fast correlations that used parallelisation produced bugs in gn2 could be re-written in a more reliable way using threads-- that's what IIRC what gn1 did. So that's something worth exploring too.
WRT the UI crashing, we rely too much on Javascript (DataTables). AFAICT, the massive results we get from the correlations are sent to DataTables to process. That's asking too much! We brainstormed on some high level ideas on how to do this. One of them being to have the results stored somewhere. And to build pagination off those results. Now that's up to Alex to decide how to go about it.
Something cool that Alex pointed is an interesting "manual" testing mechanism which he can feel free to try out: Separate the actual "computation" and the "pre-fetching" in code. And see what takes time.
Atm GN2 is un-usable for Rob for basic tours and show-and-tells, and it is a persistent problem that is getting worse the more he complains. Correlation is slower than it was ever before; and search is broken. For a simple search of 10,000 phenotypes, it takes a lot of time to compute.
According to Rob, GN1 does not rely on a cache. Instead it is computing from a materialized view of the database that is intentionally designed for a fast web service.
Most of the above issue have been addressed
correlation speed has greatly improved no complain't on the issue as of 12/04/2022
for example the dataset below no longer crashes for this datashe computa
Also, wrt to parralel computation implementation in python leads to memory error for forked processes and is best implemented in a different language if the issue arises