Edit this page | Blame

Data Integrity for Table Identifiers

Tags

  • type: bug
  • assigned: fredm
  • priority: critical
  • status: closed, completed
  • keywords: data integrity, mariadb, gn-uploader

Description

The use of

SELECT MAX(Id) FROM …

to get the identifier for a new row is not very robust, especially for auto-increment fields.

It also leads to problems when we have tables that have multiple identifiers, e.g table 'InbredSet' with identifiers 'Id' and 'InbredSetId'.

Instead, we need to do something like:

︙
cursor.execute("INSERT INTO InbredSet(…) …")
new_id = cursor.lastrowid
cursor.execute("UPDATE InbredSet SET InbredSetId=%s WHERE Id=%s",
               (new_id, new_id))
︙

To see the modules that need to be updated, do

$ cd /path/to/gn-uploader/
$ find ./ -name '*.py' -print0 | xargs --no-run-if-empty --null --max-procs=7 grep 'MAX(Id)'

The changes proposed above only work for single inserts; it will not work for `executemany()` calls like those present in the following files:

  • scripts/insert_data.py
  • scripts/rqtl2/install_phenos.py
  • scripts/rqtl2/install_genotypes.py

For these, we have to retain the `SELECT MAX(Id) …` form.

(made with skribilo)