Evaluate and compare Large Language Models (LLMs) on AWS

Whether you're evaluating a Generative AI prototype, launching to production, or maintaining a live system - rigorous but efficient testing is vital to demonstrate robustness, optimize cost, and maximize quality.

This repository collects some code samples and deployable components that can help you efficiently evaluate and optimize the performance of LLM-enabled applications - including how to automate testing to evaluate new models and prompt templates faster.

🎓 You can also check out the accompanying guided workshop for step-by-step walkthroughs and additional information.

LLM evaluation overview

Why? ▶️ Systematically measuring and comparing the performance of LLMs (and their configurations like prompt templates) is an important step towards building useful and optimized LLM-based solutions. For example:

You want to try a smaller, lower-cost model, but are worried about how it performs compared to a larger model.
You are trying a new task with your own prompts and data, and want to see how a model performs in that specific scenario.
You'd like to engineer better prompts for your use-case, to achieve the best overall performance with your current model.
You've built a new version of a model, and want to see if the performance has gone up or down.

What? ▶️ Many factors influence what LLM, prompt templates, and overall solution architecture will be best for a particular use-case, so to inform good decisions your evaluation should consider a range of criteria - like:

Usefulness: Are the answers/completions the model generates accurate and relevant to what the users need? Is it prone to hallucinate or make mistakes?
Cost: Is the cost well-aligned and justifiable for the value the solution will generate? What about other solution-dependent costs like the implementation time and effort, or ongoing maintenance if applicable?
Latency: LLMs require significant computation - how will the model's response speed affect user experience for the use-case?
Robustness: Does the solution give unbiased, stable and predictable answers? Does it maintain the right tone and handle unexpected topics as you'd like?
Safety & Security: Does the overall solution follow security best-practices? Could malicious users persuade the model to expose sensitive information, violate privacy, or generate toxic or inappropriate responses?

How? ▶️ To address this range of considerations, there's a broad spectrum of evaluation patterns and tools you can apply. For example:

Generic vs domain-specific: Although general-purpose benchmark datasets might give a high-level guide for shortlisting models, task-specific data for your use-case and domain might give very different (and much more relevant) results.
Human vs automated: While human sponsor users provide a 'gold standard' of accuracy for measuring system usefulness, you might be able to iterate much faster and optimize the solution much further by automating evaluation.
Supervised vs unsupervised: Even in 'unsupervised' cases where there's no labelled data or human review, it might be possible to define and measure some quality metrics automatically.
LLM-level vs solution-level: Common solution patterns like Retrieval-Augmented Generation (RAG), Agents, and Guardrails combine multiple tools (and perhaps multiple LLM calls) to produce final user-ready responses. LLM call-level evaluations can be useful for optimizing individual steps in these chains, whereas overall solution-level evaluations capture the final end-user experience.

Getting started

To help get your evaluation strategy up and running, this repository includes:

A prompt engineering sample app you can deploy in your AWS Account in a region where Amazon Bedrock is available.
A deployable SageMaker Pipeline with example configurations for running latency/cost performance tests using FMBench.
Some sample notebooks you'll want to run in an Amazon SageMaker Studio Domain - ideally in the same region for smoothest experience.

▶️ The simplest way to set up is by deploying our S3-hosted AWS CloudFormation template (⚠️ Check the AWS Region after following the below link, and switch if needed):

Alternatively, to guarantee you're in sync with the latest code updates, you can download the template from infra/cfn_bootstrap.yaml and then deploy it from the AWS CloudFormation console.

⚠️ Note: The above CloudFormation stacks create an AWS CodeBuild Project with broad IAM permissions to deploy the solution on your behalf. They're not recommended for use in production-environments where least-privilege principles should be followed.

If you'd like to customize your setup further, check out infra/README.md for details on how to configure and deploy the infrastructure from the AWS CDK source code.

High-level strategy

Maturing your organization's Generative AI / LLM evaluation strategy is an iterative journey and tooling specifics will vary depending on your use-case(s) and constraints. However, a strong LLM evaluation strategy will typically look something like:

Validate the use-case and architecture: Without a clear, measurable business benefit case it will be difficult to quantify what good looks like, and decide when to go live or stop investing in marginal improvements. Even if the use-case is important to the business, is it a good fit for generative LLMs?
Shortlist models: Identify a shortlist of LLMs that might be a good fit for your architecture and task
- Curated catalogs like Amazon Bedrock provide fully-managed, API-based access to a range of leading foundation models at different price points.
- Broader model hubs like Amazon SageMaker JumpStart and the Hugging Face Model Hub offer a wide selection with easy paths for deployment on pay-as-you-use Cloud infrastructure.
- Public leaderboards like HELM and the Hugging Face Open LLM Leaderboard might give useful generic performance indications for models they include - but might be missing some important models or not accurately reflect performance in your specific domain and task.
- With automatic model evaluations for Amazon Bedrock and for Amazon SageMaker, you can test Bedrock or SageMaker-deployed foundation models with no coding required.
Build task-specific dataset(s) early: Start collecting reference "test case" datasets for your specific use-case as early as possible in the project, to measure LLM quality in the context of what you're actually trying to do.
- If your use-case is open-ended like a general-purpose chatbot, try to work with sponsor users to ensure your examples accurately reflect the ways real users will interact with the solution.
- Collect both the most-likely/most-important cases, as well as edge cases your solution will need to handle gracefully.
- If you already have an idea of the internal reasoning to answer each test case, collect that to enable component-level testing. For example record which document & page the answer is derived from for RAG use-cases, or which tools should be called for agents.
- These datasets can continue to grow and evolve through the project, but will define your baseline for "what good looks like".
Start to optimize: With reference to task-specific data, run iterative evaluations to narrow your model shortlist and optimize your prompts and configurations.
- Human evaluation jobs for Amazon Bedrock and for Amazon SageMaker can help share manual validation work across your internal teams or out to external crowd workers, so you can understand performance and iterate faster.
- Keep a holistic view of performance, accounting for factors like latency, cost, robustness to edge cases, and potential bias - not just accuracy/quality on your target cases.
Automate to accelerate: From prompt engineering to inference configuration tuning to evaluating newly-released models, there's just too much work to always test by hand.
- Use automatic evaluation tools to measure model/prompt/solution accuracy metrics across your test datasets automatically: Allowing you to test and iterate faster.
- Compare human and automated evaluations on the same datasets, to measure how much trust you can place in automated heuristics aligning with human user preferences.
- As you accelerate your pace of iteration and optimization, ensure your infrastructure for version control, tracking dashboards, and (re)-deployments are keeping up.
Align automated and human metrics: With the basics of automated evaluation in place and metrics tracking how well your automated tests align to real human evaluations of LLM output quality, you're ready to consider optimizing your automated metrics themselves.
- For simple automatic evaluation pipelines, this might be straightforward choices like changing your metric of choice to align to human scores.
- For pipelines that use LLMs to evaluate the response of other LLMs, this could include prompt engineering or even fine-tuning your evaluator model to align more closely with the collected human feedback.

Try out the samples

Data-driven prompt template engineering

Once your LLMEvalWkshpStack stack has created successfully in AWS CloudFormation, select it from the list and click through to the Outputs tab where you should see:

An AppDomainName output with a hyperlink like ***.cloudfront.net
AppDemoUsername and AppDemoPassword outputs listing the credentials you can use to log in

Open the demo app and log in with the given credentials to get started.

When prompted for a dataset (unless you have your own prepared), upload the sample provided at datasets/question-answering/qa.manifest.jsonl.

⚠️ Warning: This sample app is provided to illustrate a data-driven prompt engineering workflow with automated model evaluation. It's not recommended for use with highly sensitive data or in production environments. For more information, see the infra/README.md.

You'll be able to:

Explore the sample dataset by expanding the 'dataset' section
Adjust the prompt template (noting the placeholders should match the available dataset columns)
Select a target model and evaluation algorithm, and change the expected reference answer field name, in the left sidebar
Click 'Start Evaluation' to run an evaluation with the current configuration.

Note that in addition to the default qa_accuracy evaluation algorithm from fmeval, the app provides a custom qa_accuracy_by_llm algorithm that uses Anthropic Claude to evaluate the selected model's response - rather than simple heuristics.

To customize and re-deploy this app, or run the container locally, see the documentation in infra/README.md.

Example notebooks

For users who are familiar with Python and comfortable running code, we provide example notebooks demonstrating other evaluation techniques:

LLM-Based Critique (Bedrock and fmeval).ipynb demonstrates LLM-judged evaluation for a supervised, in-context question answering use-case, using models on Amazon Bedrock and orchestrating the process via the open-source fmeval library.
LLM-Based Critique (SageMaker and Claude).ipynb shows LLM-judged evaluation for a weakly-supervised, text summarization use-case, using Anthropic Claude on Amazon Bedrock to evaluate open-source models deployed on SageMaker.
RAG (Bedrock and Ragas).ipynb explores how the open-source Ragas framework can be used to test integrated Retrieval-Augmented-Generation (RAG) flows with a suite of specialized, LLM-judged evaluation metrics.
Conversational Tests.ipynb shows how the open-source agent-evaluation framework can be used to automate testing integrated LLM-based agents/applications against a suite of example customer journeys.

These notebooks have been tested on Amazon SageMaker Studio.

Clean-up

Once you're done experimenting, you can delete the deployed stacks from the CloudFormation Console.

You may need to manually delete the container image(s) from your sm-fmbench repository in the Amazon ECR Console for the LLMPerfTestStack stack to delete successfully.

⚠️ Note that some of the lab exercises / notebooks may ask you to manually create additional resources, which you will also need to manually delete to avoid ongoing charges. In particular:

Delete any SageMaker Endpoints you may have deployed for testing Mistral and Llama models in workshop lab 1
Delete the Bedrock Knowledge Base you may have deployed for exploring RAG and end-to-end testing

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file. The sample datasets provided in datasets/question-answering are transformed subsets of the Stamford Question Answering Dataset v2.0 dev partition (original available for download here), and are provided under CC-BY-SA-4.0. See the datasets/question-answering/LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluate and compare Large Language Models (LLMs) on AWS

LLM evaluation overview

Getting started

High-level strategy

Try out the samples

Data-driven prompt template engineering

Example notebooks

Clean-up

Further reading and tools

Security

License

About

Releases

Packages

Contributors 8

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.vscode		.vscode
conversational-tests		conversational-tests
datasets		datasets
images/bedrock-kbs		images/bedrock-kbs
infra		infra
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LLM-Based Critique (Bedrock and fmeval).ipynb		LLM-Based Critique (Bedrock and fmeval).ipynb
LLM-Based Critique (SageMaker and Claude).ipynb		LLM-Based Critique (SageMaker and Claude).ipynb
RAG (Bedrock and Ragas).ipynb		RAG (Bedrock and Ragas).ipynb
README.md		README.md

License

aws-samples/llm-evaluation-methodology

Folders and files

Latest commit

History

Repository files navigation

Evaluate and compare Large Language Models (LLMs) on AWS

LLM evaluation overview

Getting started

High-level strategy

Try out the samples

Data-driven prompt template engineering

Example notebooks

Clean-up

Further reading and tools

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages