Edit this page |
Blame
Large Language Models (LLMs) & Metadata
-
assigned: soloshelby, priscilla
-
contact: bonfacem
-
keywords: gnsoc, LLMs, metadata
Integrate an LLM Q&A system into gn.genenetwork.org
This development will be done in stages:
-
[X] 1 - get API access to FahamuAI GeneNetwork Q&A system
-
[X] 2 - create local python Flask sandbox
-
[X] 3 - build placeholder UI
-
[X] 3.5 - integrate FahamuAI API into placeholder @ github.com/ShelbySolomonDarnell/GN-LLMs
-
[X] 4 - create guix package for flask site
-
[X] 5 - Serve GNQA on tux02
-
[-] 5.5 - Get feedback from testers of GNQA
-
[ ] 6 - Improve GNQA after evaluating feedback
-
[ ] 6.5 - Add reference rating to GNQA
-
[X] 7 - create UI for Q&A window that fits into current GN framework
-
[X] 8 - put GNQA GN UI on cd.genenetwork.org for internal researcher access
-
[ ] 9 - create CI/CD tests for new module
-
[-] 10 - integrate new functionality into GN{2-3}
-
[ ] 11 - Use db to save querys with answers and references for users
Tasks for Priscilla
-
[X] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes
-
[X] 2 - Acquire 1000 more research documents w.r.t GeneNetwork.org
-
[X] 3 - Get bib data for documents and put in json format
Task for collaborator
-
[X] Build feedback into api for qualifying/rating references and answers
-
[X] Associate references with their titles rather than their document ids
-
[] Build better document referencing using a documents bibliography information
Add GN metadata
-
[ ] export GN RDF triples
-
[ ] convert data of triples into plain English sentences
-
[ ] submit triples-based sentences to Q&A LLM
-
[ ] submit RDF metadata to an Oracle to support Q&A system truthfulness
Set up system update protocol
These are all living systems that must be kept up-to-date. GN is consistently being used for research and we are improving its design and functionality to make this statement perpetually true. In order to keep the Q&A system up-to-date we must:
-
[ ] create protocol to get new publications
-
[ ] query web for new publications utilizing GN
-
[ ] pull links to the newly found documents
-
[ ] acquire the documents
-
[ ] process documents for LLM
The National Library of Medicine's PubMed is a National Institute of Health system that is one of the most widely used resources for researchers found PubMed is consistently updated by the NIH, so we must build a script to:
-
[ ] poll its API on a regular basis
-
[ ] download new citations,
-
[ ] parse citations and metadata for input into LLM
-
[ ] upload new data into LLM
By ensuring up-to-date information about the main information sources for the GeneNetwork Q&A system, the system grows with the knowledgebase.
Add functionality that allows someone to submit documentation to the system, which is added after being reviewed by a specialist.
improvement suggestions from CTC
-
[ ] Ontology annotations in GN
-
[ ] Gene prioritization
-
[ ] Improving pdf to text algorithm
Build lisp tools for pdf document processing. A probable library to use can be found at https://github.com/archimag/cl-pdf
-
[X] read directory structure
-
[ ] filter pdf files
-
[ ] call library to extract text from pdf
-
[ ] create rules to remove headings, references, and appendices
-
[ ] create json document with extracted data with bibliographical information
-
[ ] take file text and run through a tokenizer
-
[ ] make string tokenizer plug-n-play