Edit this page | Blame

Prompt As UI

Description

  • New search page with ChatGPT-esque ux customised for GN that presents significant hits and metadata. Goal: better ux search experience for GN users. Feed metadata. Move away from xapian syntax search to natural language, and hits should return relevant gn-links based off metadata.
  • Build agent/model to improve GN AI experience (related to above).
  • Add pre-compute hits (already in RDF) when available.

Example:

Related tasks:

Tasks

  • [X] (bonfacem) Share existing TTL files with JoM.
  • [X] (johannesm) Build search corpus with phenotype metadata.
  • [X] (johannesm) Design RAG.
  • [X] (johannesm) Create system prompts.
  • [X] (johannesm) Test locally RAG with phenotype-related queries

We need fan-out to GN-specific pages. Related to task listed below in ⁰

  • [X] (johannesm, bonfacem) Design a proper JSON output format for the system
  • [X] ⁰(johannesm, bonfacem) Teach model how to build link to trait result page in CD
  • [ ] (johannesm, bonfacem) Extend / Modify (directly use sparql) corpus to molecular traits (later genotypes, precompute data and RIFs).
  • [ ] (johannesm, bonfacem) Have a sync on build of xapian search. We want to teach the LLM how to build a xapian search so that the user has a fallback
  • [ ] (johannesm, ! bonfacem) Add REST API for processing answers from AI processing.
  • [-] (pjotr) (WIP) Ask about DNS name to test out AI search in GN2/3; Note: Follow up later.
  • [ ] (johannes, bonfacem) Do UI work to integrate above search in GN mainline. No auth (security: use secret keys, setting up auth - maybe - for later).

Later (Genotypes/ Molecular Traits)

  • [ ] (johannesm, bonface) Add genotype data
  • [ ] (johannesm, bonfacem, pjotrp) Add pre-compute data to RDF / AI-Agent

Build search corpus wth phenotype metadata

Given the extensive training of LLM on text data, a naive approach would be converting the RDF graph related to phenotypes to text. We agreed on that with Bonz.

I wrote something similar some time back for GNAgent:

This script uses ttl files extracted from the SPARQL endpoint. As such, prefixes were replaced with full namespaces in the results.

In our case, I think I could just work with the ttl files originally generated by Bonz. For this, I however need to adapt the script because of the reason mentioned above. Basically, I need to remove namespaces from the logic and only work with prefixes. The adapted code is at:

Running it on balg01 generated a dictionary where keys represent subjects and values are list of predicates associated to the corresponding subject. Redundant objects for a specific subject were discarded.

Next, subjects and objects were linked to form English-like sentences with the following logic - at the first build time:

docs = []
for key in tqdm(collection):
    concat = ""
    for value in collection[key]:
    	text = f"{key} is/has {value}. "
	concat += text
    docs.append(concat)

See function corpus_to_docs of:

Documents look like:

gnc:set is/has skos:member gn:set_B6MRLF2_D2MRLF2 .
gnc:set is/has skos:member gn:set_MAGIC_Lines .
gnt:family is/has a owl:ObjectProperty .
gnt:family is/has rdfs:domain gnc:species .
gnt:family is/has skos:definition This resource belongs to this family .
gnt:family is/has rdfs:domain gnc:set . ", "gnt:short_name is/has a owl:ObjectProperty .

Design RAG

I built a simple RAG system that answers a question based on a corpus. Given the fragility of LLM system, I leveraged DSPy. This should also make it easy to switch between proprietary and open models.

You can inspect full implementation details at:

Create system prompt

To get the system return a concise answer and point to specific URLs for verification, the system prompt is probably the most important and dynamic part. My first draft had the following instructions:

You are an expert in biology and genomics. You excel at leveraging the data or context you have been given to address any user query.
Give an accurate and elaborate response to the query below.
In addition, provide links that the users can visit to verify information or dig deeper. To build link you must replace RDF prefixes by namespaces.

Below is the mapping of prefixes and namespaces:
gn => http://rdf.genenetwork.org/v1/id
gnc => http://rdf.genenetwork.org/v1/category
owl => http://www.w3.org/2002/07/owl
gnt => http://rdf.genenetwork.org/v1/term
skos = http://www.w3.org/2004/02/skos/core
xkos => http://rdf-vocabulary.ddialliance.org/xkos
rdf => http://www.w3.org/1999/02/22-rdf-syntax-ns
rdfs => http://www.w3.org/2000/01/rdf-schema
taxon => http://purl.uniprot.org/taxonomy
dcat => http://www.w3.org/ns/dcat
dct => http://purl.org/dc/terms
xsd => http://www.w3.org/2001/XMLSchema
sdmx-measure => http://purl.org/linked-data/sdmx/2009/measure
qb => http://purl.org/linked-data/cube
pubmed => http://rdf.ncbi.nlm.nih.gov/pubmed
v => http://www.w3.org/2006/vcard/ns
foaf => http://xmlns.com/foaf/0.1
geoSeries => http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc

Do not make any mistakes.

I provided mapping between prefix and namespace to teach the model how to generate the URLs. I will probably have to do a number of experimentations :)

Design a proper JSON output format for the system

It is extremely useful to control the system by defining an output format. This should also help parse output to other tools when the time comes.

Reviewing the options...

- Asking the LLM to format its output as JSON is one way

I could just let the LLM format the output as JSON. But the JSON generated might not be valid.

- Passing JSON format as example in prompt

Another option is to predefine the format of the JSON and pass it in the prompt to the system. Also, some models might deviate from the instructions.

- Defining a schema

Finally, I could define an output schema the LLM needs to comply to. DSPy offers an adapter (JSONAdapter) that facilitates its implementation. This is regardless of the model used with the system.

I decided to go for the last option because of robustness. I created a schema using pydantic BaseModel and used with the DSPy predictor as below:

class Information(BaseModel):
    """Extract relevant information for query"""
    answer: str = Field(description="Specific point addressing the query from the context")
    links: List[str] = Field(description="All links associated to RDF entities related to the point")

class ListInformation(BaseModel):
    """Address recursively a query"""
    detailed_answers: List[Information] = Field(description="List of answers to the query")
    final_answer: str = Field(description="Synthesized and comprehensive answer using detailed answers")

class Generate(dspy.Signature):
    """Wrap generation interface"""
    context: list = dspy.InputField(desc="Background information")
    input_text: str = dspy.InputField(desc="Query and instructions")
    feedback: ListInformation = dspy.OutputField(desc="System response to the query")

See the workings at:

I also iterated on the system prompt. Now it is:

You excel at addressing search query using the context you have. You do not mistakes.
Extract answers to the query from the context and provide links associated with each RDF entity.
To build links you must replace RDF prefixes by namespaces.
Here is the mapping of prefixes and namespaces:
gn => http://rdf.genenetwork.org/v1/id
gnc => http://rdf.genenetwork.org/v1/category
owl => http://www.w3.org/2002/07/owl
gnt => http://rdf.genenetwork.org/v1/term
skos = http://www.w3.org/2004/02/skos/core
xkos => http://rdf-vocabulary.ddialliance.org/xkos
rdf => http://www.w3.org/1999/02/22-rdf-syntax-ns
rdfs => http://www.w3.org/2000/01/rdf-schema
taxon => http://purl.uniprot.org/taxonomy
dcat => http://www.w3.org/ns/dcat
dct => http://purl.org/dc/terms
xsd => http://www.w3.org/2001/XMLSchema
sdmx-measure => http://purl.org/linked-data/sdmx/2009/measure
qb => http://purl.org/linked-data/cube
pubmed => http://rdf.ncbi.nlm.nih.gov/pubmed
v => http://www.w3.org/2006/vcard/ns
foaf => http://xmlns.com/foaf/0.1
geoSeries => http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc
\n

Teach model how to build link to trait result page in CD

From conversations with Bonz, trait id and dataset are coded and link to the result page. We can leverage that to teach the system how to build CD link when it has access to the trait id and the dataset name for a specific trait in GN.

RDF codes trait as "dataset name" + "trait id" under a specific namespace.

Example: https://rdf.genenetwork.org/v1/id/trait_BXDPublish_16339

This specific trait has an id of 16339 in the BXDPublish GN table (dataset name).

The corresponding link in CD to the Trait result page is: https://cd.genenetwork.org/show_trait?trait_id=16339&dataset=BXDPublish

It is just a matter of replacing the trait id and the dataset name in the URL parameters.

The only ways (at least those I can think of) to get an LLM make that substitution in the URL is through prompt engineering and model finetuning. Model finetuning seems a bit too much given that we only want to modify a few links in the output. In addition, it is very expensive. On the other hand, prompt engineering is quick to implement. I am going to provide examples like the previous one in the system prompt to help LLM translate RDF links for traits to valid CD links to trait result page.

New system prompt:

You excel at addressing search query using the context you have. You do not make mistakes.
Extract answers to the query from the context and provide links associated with each RDF entity.
To build links you must replace RDF prefixes by namespaces.
Here is the mapping of prefixes and namespaces:
gn => http://rdf.genenetwork.org/v1/id
gnc => http://rdf.genenetwork.org/v1/category
owl => http://www.w3.org/2002/07/owl
gnt => http://rdf.genenetwork.org/v1/term
skos = http://www.w3.org/2004/02/skos/core
xkos => http://rdf-vocabulary.ddialliance.org/xkos
rdf => http://www.w3.org/1999/02/22-rdf-syntax-ns
rdfs => http://www.w3.org/2000/01/rdf-schema
taxon => http://purl.uniprot.org/taxonomy
dcat => http://www.w3.org/ns/dcat
dct => http://purl.org/dc/terms
xsd => http://www.w3.org/2001/XMLSchema
sdmx-measure => http://purl.org/linked-data/sdmx/2009/measure
qb => http://purl.org/linked-data/cube
pubmed => http://rdf.ncbi.nlm.nih.gov/pubmed
v => http://www.w3.org/2006/vcard/ns
foaf => http://xmlns.com/foaf/0.1
geoSeries => http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc

Link pointing to specific trait should be translated to CD links using the trait id and the dataset name.
Original trait link: https://rdf.genenetwork.org/v1/id/trait_BXDPublish_16339
Trait id: 16339
Dataset name: BXDPublish
New trait link: https://cd.genenetwork.org/show_trait?trait_id=16339&dataset=BXDPublish
\n

This was enough to get the system return valid CD links with Claude models :)

Here is an example.

Query: What are the traits related to the BXD?

System feedback:
{
    "detailed_answers": [
        {
            "answer": "BXD_16337 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) in 1 to 2 months old BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 15.27684536576271 mmHg.",                                                                                                                   
            "links": [
                "https://cd.genenetwork.org/show_trait?trait_id=16337&dataset=BXDPublish"
            ]
        },
        {
            "answer": "BXD_16338 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) in 2.1 to 5 months old BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 16.091910903133563 mmHg.",                                                                                                                
            "links": [
                "https://cd.genenetwork.org/show_trait?trait_id=16338&dataset=BXDPublish"
            ]
        },
        {
            "answer": "BXD_16339 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) in 5.1 to 9 months old BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 16.563036555975255 mmHg.",                                                                                                                
            "links": [
                "https://cd.genenetwork.org/show_trait?trait_id=16339&dataset=BXDPublish"
            ]
        },
        {
            "answer": "BXD_16340 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) in 9.1 to 13 months old BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 15.659858761411725 mmHg.",                                                                                                               
            "links": [
                "https://cd.genenetwork.org/show_trait?trait_id=16340&dataset=BXDPublish"
            ]
        },
        {
            "answer": "BXD_16342 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) across all ages (1 to 30 months old) in BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 15.797181273159916 mmHg.",                                                                                               
            "links": [
                "https://cd.genenetwork.org/show_trait?trait_id=16342&dataset=BXDPublish"
            ]
        }
    ],
    "final_answer": "The BXDPublish dataset contains multiple traits related to Intraocular pressure (IOP) measurements in BXD mice. These include: BXD_16337 (IOP in 1-2 months old mice, mean 15.28 mmHg), BXD_16338 (IOP in 2.1-5 months old mice, mean 16.09 mmHg), BXD_16339 (IOP in 5.1-9 months old mice, mean 16.56 mmHg), BXD_16340 (IOP in 9.1-13 months old mice, mean 15.66 mmHg), and BXD_16342 (IOP across all ages 1-30 months, mean 15.80 mmHg). All measurements are from both sexes and represent averages of left and right eyes."                     
}

(made with skribilo)