In this repository, we demonstrate how to ask natural language questions about proteins on the Universal Protein Resource (UniProt) dataset. UniProt is a graph dataset defined using Resource Description Framework (RDF). We can query it using RDF's SPARQL query language against a triple store that has the UniProt data.
We demonstrate how to enable users who have domain knowledge of proteins (but are not necessarily developers proficient in SPARQL) to ask natural language questions about proteins and use a large language model (LLM) to convert the question to a SPARQL query.
For example, if the user asks Select all bacterial taxa and their scientific names from the UniProt taxonomy
, we will attempt to have the LLM generate a SPARQL query like the following:
SELECT ?taxon ?name
WHERE
{
?taxon a up:Taxon .
?taxon up:scientificName ?name .
?taxon rdfs:subClassOf taxon:2 .
}
We use the following design.
The user asks a question on an Amazon SageMaker instance notebook instance. The notebook generates the SPARQL query using an LLM (Anthropic Claude) via Amazon Bedrock. The notebook then runs that query against a UniProt database: either the UniProf reference SPARQL endpoint or your own Amazon Neptune database loaded with UniProt data.
To generate the query, we prompt the LLM with the question plus a set of ground-truth examples and tips. See resources folder for tips, ground-truth, and prompts. We use a few shot approach. We give the LLM several examples of questions and their correct SPARQL query. We expect the LLM will use these to write the correct SPARQL for other UniProt questions.
To setup this solution, you need an AWS account with permission to provision use of a SageMaker notebook instance, Bedrock models, a Neptune cluster, and an Amazon Simple Storage Service (S3) bucket.
In your AWS console, open the Bedrock console and request model access for Claude 3.5 Sonnet under Anthropic. For instructions how to request model access, follow https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html.
Check back until the model shows as Access granted.
Create an Amazon Simple Storage Service (S3) bucket in the same account and region in which you deploy the other resources. This bucket is used to stage UniProt data for load into the Neptune database. If you don't intend to use the Neptune database, you may skip this step.
Follow instructions in https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html. The bucket may be private and use default encryption. Take note of your bucket name and resource ARN for upcoming deployment steps.
Create a Neptune cluster and a notebook instance. One way to setup these resources is using the CloudForamtion template via https://docs.aws.amazon.com/neptune/latest/userguide/get-started-cfn-create.html. We recommend using a NotebookInstanceType
of ml.t3.medium
or higher. If you don't intend to use the Neptune database, you may skip this step.
We use Jupyter as our test client. If you setup a Neptune cluster, a Sagemaker notebook instance has already been created for you, but additional setup steps are reuired. If you did not setup a Neptune cluster, you can provision a SageMaker notebook instance or install Jupyter in a non-SageMaker environment.
In the SageMaker console, locate the notebook instance that was created by the Neptune cluster CloudFormation stack. Find its IAM role under Permissions and encryption
on the details page for the notebook. Select that role and add the following IAM policies:
- The notebook should have read-write access to the uniprot data. For example, if the data is stored in s3://my-uniprot-data (this must be same as
STAGING_BUCKET
inuniprot_loader.ipynb
):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::my-uniprot-data",
"arn:aws:s3:::my-uniprot-data/*"
]
}
]
}
- The notebook needs access to Bedrock; following the principle of least privileges, you can attach a policy like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:Invoke*"
],
"Resource": "*"
}
]
}
Create a SageMaker notebook instance. Choose Amazon Linux 2, Jupyter Lab 3
as the platform identifier. Ensure the IAM role for the instance has full Bedrock access as above e. Ensure the instance's network allows connectivity to both the Bedrock service and the public internet.
See https://docs.aws.amazon.com/sagemaker/latest/dg/howitworks-create-ws.html.
In your notebook instance, clone this repository. Run the notebooks in the following order:
- (Optional) Open uniprot_loader.ipynb to load Uniprot data to your Neptune database. Run through the cells: set the name of the S3 staging bucket that you created; synchronize a copy of the UniProt files from a public bucket to your staging bucket; bulk-load the files from your staging bucket to the Neptune cluster; then verify by running sample SPARQL queries on the Neptune database.
- The UniProt dataset is large (several hundred GB), so to improve load time into Neptune, we recommend changing the instance type of the writer instance in the cluster to be r6i.12xlarge or r5.12xlarge instance prior to starting the load. When load is complete, switch back to the instance size you were using previously.
- Open get_expected_results.ipynb to run each of the ground-truth example queries -- which you can find in resources/ground-truth.yaml -- against either the UniProt reference site or your Neptune database. The results are written to a local folder called
up
(if run against UniProt reference) orexpected
(if run against Neptune database). We provide a copy of that folder in this repo -- up -- for comparison. - Open run_gen_tests.ipynb to test LLM generation of natural language UniProt questions. The notebook tests each question in ground truth, prompting the LLM to generate SPARQL for each, then running the generated SPARQL against either the UniProf reference site (by default) or your Neptune database. Results are written to the local
gen_results
folder. We provide a copy of that folder in this repo -- gen_results -- for comparison. You can also ask your own question too. Seerun_yourown_query()
examples. - Open compared_expected_gen.ipynb to compare expected and actual queries. The notebook effectively compares results of the previous two notebooks. It presents its results for side-by-side comparison, question by question, in HTML form. You can review our results in comparison.html.
If you are done and wish to avoid further charges, remove the solution as follows:
- Delete the CloudFormation stack you created for the Neptune cluster and notebook instance. See https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-delete-stack.html for instructions how to delete a stack.
- Remove the S3 bucket. See https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html.
- If you created a SageMaker notebook instance, remove it. See https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html.
This solution incurs cost. Refer to pricing guides for Neptune, S3, Bedrock, and SageMaker.
Here is a rough estimate of cost based on current US-East-1 pricing. This is not a quote. Your costs may differ. Please check your own costs carefully.
- S3 cost: $12/month for 500 GB storage and 1M GET requests in S3 Standard storage class.
- SageMaker notebook cost: $11/month for single ml.t3.medium instance running 30 percent of the time (and shut down the remainder of the time)
- Bedrock: $10 total to run the 40+ query generations for Anthropic Claude Sonnet 3.5.
- Neptune cost during load:
- $600 cost for 72 hours use of single r5.12xlarge instance.
- $60 storage and IO cost for 500 GB and 50M IO requests.
- Neptune cost post-load:
- $260/month instance cost for serverless instance consuming 16 NCU for about 2 hours/day and idle the rest of the time.
- $50/month storage cost for 500 GB
As mentioned above, use of Neptune is optional.