Edit this page | Blame

Prompt As UI

Description

  • New search page with ChatGPT-esque ux customised for GN that presents significant hits and metadata. Goal: better ux search experience for GN users. Feed metadata. Move away from xapian syntax search to natural language, and hits should return relevant gn-links based off metadata.
  • Build agent/model to improve GN AI experience (related to above).
  • Add pre-compute hits (already in RDF) when available.

Example:

Related tasks:

Tasks

  • [X] (bonfacem) Share existing TTL files with JoM.
  • [X] (johannesm) Build search corpus with phenotype metadata.
  • [X] (johannesm) Design RAG.
  • [X] (johannesm) Create system prompts.
  • [ ] (johannesm) Test locally RAG with phenotype-related queries
  • [ ] (johannesm, bonfacem) Extend corpus to molecular traits, genotypes, precompute data and RIFs.
  • [ ] (johannesm, bonfacem) Add REST API for processing answers from AI processing.
  • [ ] (johannes, bonfacem) Do UI work to integrate above search in GN mainline. No auth (security: use secret keys, setting up auth - maybe - for later).

Build search corpus wth phenotype metadata

Given the extensive training of LLM on text data, a naive approach would be converting the RDF graph related to phenotypes to text. We agreed on that with Bonz.

I wrote something similar some time back for GNAgent:

This script uses ttl files extracted from the SPARQL endpoint. As such, prefixes were replaced with full namespaces in the results.

In our case, I think I could just work with the ttl files originally generated by Bonz. For this, I however need to adapt the script because of the reason mentioned above. Basically, I need to remove namespaces from the logic and only work with prefixes. The adapted code is at:

Running it on balg01 generated a dictionary where keys represent subjects and values are list of predicates associated to the corresponding subject. Redundant objects for a specific subject were discarded.

Next, subjects and objects were linked to form English-like sentences with the following logic - at the first build time:

docs = []
for key in tqdm(collection):
    concat = ""
    for value in collection[key]:
    	text = f"{key} is/has {value}. "
	concat += text
    docs.append(concat)

See function corpus_to_docs of:

Documents look like:

gnc:set is/has skos:member gn:set_B6MRLF2_D2MRLF2 .
gnc:set is/has skos:member gn:set_MAGIC_Lines .
gnt:family is/has a owl:ObjectProperty .
gnt:family is/has rdfs:domain gnc:species .
gnt:family is/has skos:definition This resource belongs to this family .
gnt:family is/has rdfs:domain gnc:set . ", "gnt:short_name is/has a owl:ObjectProperty .

Design RAG

I built a simple RAG system that answers a question based on a corpus. Given the fragility of LLM system, I leveraged dspy framework. This should also make it easy to switch between proprietary and open models.

You can inspect full implementation details at:

Create system prompt

To get the system return a concise answer and point to specific URLs for verification, the system prompt is probably the most important and dynamic part. My first draft had the following instructions:

You are an expert in biology and genomics. You excel at leveraging the data or context you have been given to address any user query.
Give an accurate and elaborate response to the query below.
In addition, provide links that the users can visit to verify information or dig deeper. To build link you must replace RDF prefixes by namespaces.

Below is the mapping of prefixes and namespaces:
gn => http://rdf.genenetwork.org/v1/id
gnc => http://rdf.genenetwork.org/v1/category
owl => http://www.w3.org/2002/07/owl
gnt => http://rdf.genenetwork.org/v1/term
skos = http://www.w3.org/2004/02/skos/core
xkos => http://rdf-vocabulary.ddialliance.org/xkos
rdf => http://www.w3.org/1999/02/22-rdf-syntax-ns
rdfs => http://www.w3.org/2000/01/rdf-schema
taxon => http://purl.uniprot.org/taxonomy
dcat => http://www.w3.org/ns/dcat
dct => http://purl.org/dc/terms
xsd => http://www.w3.org/2001/XMLSchema
sdmx-measure => http://purl.org/linked-data/sdmx/2009/measure
qb => http://purl.org/linked-data/cube
pubmed => http://rdf.ncbi.nlm.nih.gov/pubmed
v => http://www.w3.org/2006/vcard/ns
foaf => http://xmlns.com/foaf/0.1
geoSeries => http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc

Do not make any mistakes.

I provided mapping between prefix and namespace to teach the model how to generate the URLs. I will probably have to do a number of experimentations :)

(made with skribilo)