Example:
Related tasks:
We need fan-out to GN-specific pages. Related to task listed below in ⁰
Given the extensive training of LLM on text data, a naive approach would be converting the RDF graph related to phenotypes to text. We agreed on that with Bonz.
I wrote something similar some time back for GNAgent:
This script uses ttl files extracted from the SPARQL endpoint. As such, prefixes were replaced with full namespaces in the results.
In our case, I think I could just work with the ttl files originally generated by Bonz. For this, I however need to adapt the script because of the reason mentioned above. Basically, I need to remove namespaces from the logic and only work with prefixes. The adapted code is at:
Running it on balg01 generated a dictionary where keys represent subjects and values are list of predicates associated to the corresponding subject. Redundant objects for a specific subject were discarded.
Next, subjects and objects were linked to form English-like sentences with the following logic - at the first build time:
docs = []
for key in tqdm(collection):
concat = ""
for value in collection[key]:
text = f"{key} is/has {value}. "
concat += text
docs.append(concat)
See function corpus_to_docs of:
Documents look like:
gnc:set is/has skos:member gn:set_B6MRLF2_D2MRLF2 . gnc:set is/has skos:member gn:set_MAGIC_Lines . gnt:family is/has a owl:ObjectProperty . gnt:family is/has rdfs:domain gnc:species . gnt:family is/has skos:definition This resource belongs to this family . gnt:family is/has rdfs:domain gnc:set . ", "gnt:short_name is/has a owl:ObjectProperty .
I built a simple RAG system that answers a question based on a corpus. Given the fragility of LLM system, I leveraged DSPy. This should also make it easy to switch between proprietary and open models.
You can inspect full implementation details at:
To get the system return a concise answer and point to specific URLs for verification, the system prompt is probably the most important and dynamic part. My first draft had the following instructions:
You are an expert in biology and genomics. You excel at leveraging the data or context you have been given to address any user query. Give an accurate and elaborate response to the query below. In addition, provide links that the users can visit to verify information or dig deeper. To build link you must replace RDF prefixes by namespaces. Below is the mapping of prefixes and namespaces: gn => http://rdf.genenetwork.org/v1/id gnc => http://rdf.genenetwork.org/v1/category owl => http://www.w3.org/2002/07/owl gnt => http://rdf.genenetwork.org/v1/term skos = http://www.w3.org/2004/02/skos/core xkos => http://rdf-vocabulary.ddialliance.org/xkos rdf => http://www.w3.org/1999/02/22-rdf-syntax-ns rdfs => http://www.w3.org/2000/01/rdf-schema taxon => http://purl.uniprot.org/taxonomy dcat => http://www.w3.org/ns/dcat dct => http://purl.org/dc/terms xsd => http://www.w3.org/2001/XMLSchema sdmx-measure => http://purl.org/linked-data/sdmx/2009/measure qb => http://purl.org/linked-data/cube pubmed => http://rdf.ncbi.nlm.nih.gov/pubmed v => http://www.w3.org/2006/vcard/ns foaf => http://xmlns.com/foaf/0.1 geoSeries => http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc Do not make any mistakes.
I provided mapping between prefix and namespace to teach the model how to generate the URLs. I will probably have to do a number of experimentations :)
It is extremely useful to control the system by defining an output format. This should also help parse output to other tools when the time comes.
Reviewing the options...
- Asking the LLM to format its output as JSON is one way
I could just let the LLM format the output as JSON. But the JSON generated might not be valid.
- Passing JSON format as example in prompt
Another option is to predefine the format of the JSON and pass it in the prompt to the system. Also, some models might deviate from the instructions.
- Defining a schema
Finally, I could define an output schema the LLM needs to comply to. DSPy offers an adapter (JSONAdapter) that facilitates its implementation. This is regardless of the model used with the system.
I decided to go for the last option because of robustness. I created a schema using pydantic BaseModel and used with the DSPy predictor as below:
class Information(BaseModel):
"""Extract relevant information for query"""
answer: str = Field(description="Specific point addressing the query from the context")
links: List[str] = Field(description="All links associated to RDF entities related to the point")
class ListInformation(BaseModel):
"""Address recursively a query"""
detailed_answers: List[Information] = Field(description="List of answers to the query")
final_answer: str = Field(description="Synthesized and comprehensive answer using detailed answers")
class Generate(dspy.Signature):
"""Wrap generation interface"""
context: list = dspy.InputField(desc="Background information")
input_text: str = dspy.InputField(desc="Query and instructions")
feedback: ListInformation = dspy.OutputField(desc="System response to the query")
See the workings at:
I also iterated on the system prompt. Now it is:
You excel at addressing search query using the context you have. You do not mistakes. Extract answers to the query from the context and provide links associated with each RDF entity. To build links you must replace RDF prefixes by namespaces. Here is the mapping of prefixes and namespaces: gn => http://rdf.genenetwork.org/v1/id gnc => http://rdf.genenetwork.org/v1/category owl => http://www.w3.org/2002/07/owl gnt => http://rdf.genenetwork.org/v1/term skos = http://www.w3.org/2004/02/skos/core xkos => http://rdf-vocabulary.ddialliance.org/xkos rdf => http://www.w3.org/1999/02/22-rdf-syntax-ns rdfs => http://www.w3.org/2000/01/rdf-schema taxon => http://purl.uniprot.org/taxonomy dcat => http://www.w3.org/ns/dcat dct => http://purl.org/dc/terms xsd => http://www.w3.org/2001/XMLSchema sdmx-measure => http://purl.org/linked-data/sdmx/2009/measure qb => http://purl.org/linked-data/cube pubmed => http://rdf.ncbi.nlm.nih.gov/pubmed v => http://www.w3.org/2006/vcard/ns foaf => http://xmlns.com/foaf/0.1 geoSeries => http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc \n
From conversations with Bonz, trait id and dataset are coded and link to the result page. We can leverage that to teach the system how to build CD link when it has access to the trait id and the dataset name for a specific trait in GN.
RDF codes trait as "dataset name" + "trait id" under a specific namespace.
Example: https://rdf.genenetwork.org/v1/id/trait_BXDPublish_16339
This specific trait has an id of 16339 in the BXDPublish GN table (dataset name).
The corresponding link in CD to the Trait result page is: https://cd.genenetwork.org/show_trait?trait_id=16339&dataset=BXDPublish
It is just a matter of replacing the trait id and the dataset name in the URL parameters.
The only ways (at least those I can think of) to get an LLM make that substitution in the URL is through prompt engineering and model finetuning. Model finetuning seems a bit too much given that we only want to modify a few links in the output. In addition, it is very expensive. On the other hand, prompt engineering is quick to implement. I am going to provide examples like the previous one in the system prompt to help LLM translate RDF links for traits to valid CD links to trait result page.
New system prompt:
You excel at addressing search query using the context you have. You do not make mistakes. Extract answers to the query from the context and provide links associated with each RDF entity. To build links you must replace RDF prefixes by namespaces. Here is the mapping of prefixes and namespaces: gn => http://rdf.genenetwork.org/v1/id gnc => http://rdf.genenetwork.org/v1/category owl => http://www.w3.org/2002/07/owl gnt => http://rdf.genenetwork.org/v1/term skos = http://www.w3.org/2004/02/skos/core xkos => http://rdf-vocabulary.ddialliance.org/xkos rdf => http://www.w3.org/1999/02/22-rdf-syntax-ns rdfs => http://www.w3.org/2000/01/rdf-schema taxon => http://purl.uniprot.org/taxonomy dcat => http://www.w3.org/ns/dcat dct => http://purl.org/dc/terms xsd => http://www.w3.org/2001/XMLSchema sdmx-measure => http://purl.org/linked-data/sdmx/2009/measure qb => http://purl.org/linked-data/cube pubmed => http://rdf.ncbi.nlm.nih.gov/pubmed v => http://www.w3.org/2006/vcard/ns foaf => http://xmlns.com/foaf/0.1 geoSeries => http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc Link pointing to specific trait should be translated to CD links using the trait id and the dataset name. Original trait link: https://rdf.genenetwork.org/v1/id/trait_BXDPublish_16339 Trait id: 16339 Dataset name: BXDPublish New trait link: https://cd.genenetwork.org/show_trait?trait_id=16339&dataset=BXDPublish \n
This was enough to get the system return valid CD links with Claude models :)
Here is an example.
Query: What are the traits related to the BXD?
System feedback:
{
"detailed_answers": [
{
"answer": "BXD_16337 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) in 1 to 2 months old BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 15.27684536576271 mmHg.",
"links": [
"https://cd.genenetwork.org/show_trait?trait_id=16337&dataset=BXDPublish"
]
},
{
"answer": "BXD_16338 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) in 2.1 to 5 months old BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 16.091910903133563 mmHg.",
"links": [
"https://cd.genenetwork.org/show_trait?trait_id=16338&dataset=BXDPublish"
]
},
{
"answer": "BXD_16339 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) in 5.1 to 9 months old BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 16.563036555975255 mmHg.",
"links": [
"https://cd.genenetwork.org/show_trait?trait_id=16339&dataset=BXDPublish"
]
},
{
"answer": "BXD_16340 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) in 9.1 to 13 months old BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 15.659858761411725 mmHg.",
"links": [
"https://cd.genenetwork.org/show_trait?trait_id=16340&dataset=BXDPublish"
]
},
{
"answer": "BXD_16342 is a trait related to BXDPublish dataset. It measures Intraocular pressure (IOP) across all ages (1 to 30 months old) in BXD mice of both sexes, with an average of left and right eyes. The trait has a mean value of 15.797181273159916 mmHg.",
"links": [
"https://cd.genenetwork.org/show_trait?trait_id=16342&dataset=BXDPublish"
]
}
],
"final_answer": "The BXDPublish dataset contains multiple traits related to Intraocular pressure (IOP) measurements in BXD mice. These include: BXD_16337 (IOP in 1-2 months old mice, mean 15.28 mmHg), BXD_16338 (IOP in 2.1-5 months old mice, mean 16.09 mmHg), BXD_16339 (IOP in 5.1-9 months old mice, mean 16.56 mmHg), BXD_16340 (IOP in 9.1-13 months old mice, mean 15.66 mmHg), and BXD_16342 (IOP across all ages 1-30 months, mean 15.80 mmHg). All measurements are from both sexes and represent averages of left and right eyes."
}