Edit this page | Blame

Inspect Discrepancies Between Xapian and SQL Search.

Description

When doing a Xapian search, we miss some data that is available from the SQL Search. The searches we tested:

We miss the following entries from the Xapian search:

15	1423803_s_at	Gltscr2	glioma tumor suppressor candidate region gene 2
16	1451121_a_at	Gltscr2	glioma tumor suppressor candidate region 2; exons 8 and 9
17	1452409_at	Gltscr2	glioma tumor suppressor candidate region gene 2
25	1416556_at	Sas	sarcoma amplified sequence
26	1430029_a_at	Sas	sarcoma amplified sequence

We want to figure out why there is a discrepancy between the 2 searches above.

Resolution

Use "quest" to search for one of the symbols that don't appear in the Xapian search to get the exact document id:

quest --msize=2 -s en --boolean-prefix="iden:Qgene:" "iden:"1423803_s_at:hc_m2_0606_p"" \
--db=/export/data/genenetwork-xapian/

Parsed Query: Query(0 * Qgene:1423803_s_at:hc_m2_0606_p)
Exactly 1 matches
MSet:
9665867: [0]
{
  "name": "1423803_s_at",
  "symbol": "Gltscr2",
  "description": "glioma tumor suppressor candidate region gene 2",
  "chr": "1",
  "mb": 4.687986,
  "dataset": "HC_M2_0606_P",
  "dataset_fullname": "Hippocampus Consortium M430v2 (Jun06) PDNN",
  "species": "mouse",
  "group": "BXD",
  "tissue": "Hippocampus mRNA",
  "mean": 11.749030303030299,
  "lrs": 11.3847971289981,
  "additive": -0.0650828877005346,
  "geno_chr": "5",
  "geno_mb": 137.010795
}

From the retrieved document-id, use "xapian-delve" to inspect the terms inside the index:

xapian-delve -r 9665867 -d /export/data/genenetwork-xapian/

Data for record #9665867:
{
  "name": "1423803_s_at",
  "symbol": "Gltscr2",
  "description": "glioma tumor suppressor candidate region gene 2",
  "chr": "1",
  "mb": 4.687986,
  "dataset": "HC_M2_0606_P",
  "dataset_fullname": "Hippocampus Consortium M430v2 (Jun06) PDNN",
  "species": "mouse",
  "group": "BXD",
  "tissue": "Hippocampus mRNA",
  "mean": 11.749030303030299,
  "lrs": 11.3847971289981,
  "additive": -0.0650828877005346,
  "geno_chr": "5",
  "geno_mb": 137.010795
}
Term List for record #9665867: 1423803_s_at 2 5330430h08rik
9430097c02rik Qgene:1423803_s_at:hc_m2_0606_p
XC1 XDShc_m2_0606_p XGbxd XIhippocampus XImrna XPC5
XSmouse XTgene XYgltscr2 ZXDShc_m2_0606_p ZXGbxd
ZXIhippocampus ZXImrna ZXSmous ZXYgltscr2 Zbc017637
Zbxd Zcandid Zgene Zglioma Zgltscr2 Zhc_m2_0606_p
Zhippocampus Zmous Zmrna Zregion Zsuppressor Ztumor
bc017637 bxd candidate gene glioma gltscr2
hc_m2_0606_p hippocampus mouse mrna
region suppressor tumor

We have no wiki (XWK) entries from the above. When transforming to TTL files from SQL, we have symbols that exist in the GeneRIF table that do not exist in the GeneRIF_BASIC table:

SELECT COUNT(symbol) FROM GeneRIF WHERE
symbol NOT IN (SELECT symbol FROM GeneRIF_BASIC)
GROUP BY BINARY symbol;

Consequently, this means that after transforming to TTL files, we have some missing RDF entries that map a symbol (subject) to it's real name (object). When building the RDF cache, we thereby have some missing RIF/WIKI entries, and some entries are not indexed. This patch fixes the aforementioned error with missing symbols:

Now these 2 queries return the same exact results:

However, Xapian search is case insensitive while the SQL search is case sensitive:

Another reason for discrepancies between search results, E.g.

is that Xapian performs stemming on the search terms. For example, in the above wiki search for "diabetes", Xapian will stem "diabetes" to "diabet" thereby matching "diabetic", "diabetes", or any other word variation of "diabetes."

Ordering of Results

The ordering in the Xapian search and SQL search is different. By default, SQL orders by Symbol where we have:

[...] ORDER BY ProbeSet.symbol ASC

However, Xapian orders search results by decreasing relevance score. This is configurable.

  • closed
(made with skribilo)