Introduction
Quick Summary
Evaluation refers to the process of testing your LLM application outputs, and requires the following components:
- Test cases
- Metrics
- Evaluation dataset
Here's a diagram of what an ideal evaluation workflow looks like using deepeval
:

Your test cases will typically be in a single python file, and executing them will be as easy as running deepeval test run
:
deepeval test run test_example.py
Click here for an end-to-end tutorial on how to evaluate an LLM medical chatbot using deepeval
.
Metrics
deepeval
offers 30+ evaluation metrics, most of which are evaluated using LLMs (visit the metrics section to learn why).
from deepeval.metrics import AnswerRelevancyMetric
answer_relevancy_metric = AnswerRelevancyMetric()
You'll need to create a test case to run deepeval
's metrics.
Test Cases
In deepeval
, a test case allows you to use evaluation metrics you have defined to unit test LLM applications.
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="Who is the current president of the United States of America?",
actual_output="Joe Biden",
retrieval_context=["Joe Biden serves as the current president of America."]
)
In this example, input
mimics an user interaction with a RAG-based LLM application, where actual_output
is the output of your LLM application and retrieval_context
is the retrieved nodes in your RAG pipeline. Creating a test case allows you to evaluate using deepeval
's default metrics:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
answer_relevancy_metric = AnswerRelevancyMetric()
test_case = LLMTestCase(
input="Who is the current president of the United States of America?",
actual_output="Joe Biden",
retrieval_context=["Joe Biden serves as the current president of America."]
)
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
Datasets
Datasets in deepeval
is a collection of test cases. It provides a centralized interface for you to evaluate a collection of test cases using one or multiple metrics.
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric
answer_relevancy_metric = AnswerRelevancyMetric()
test_case = LLMTestCase(
input="Who is the current president of the United States of America?",
actual_output="Joe Biden",
retrieval_context=["Joe Biden serves as the current president of America."]
)
dataset = EvaluationDataset(test_cases=[test_case])
dataset.evaluate([answer_relevancy_metric])
You don't need to create an evaluation dataset to evaluate individual test cases. Visit the test cases section to learn how to assert inidividual test cases.
Synthesizer
In deepeval
, the Synthesizer
allows you to generate synthetic datasets. This is especially helpful if you don't have production data or you don't have a golden dataset to evaluate with.
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import EvaluationDataset
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf']
)
dataset = EvaluationDataset(goldens=goldens)
deepeval
's Synthesizer
is highly customizable, and you can learn more about it here.
Evaluating With Pytest
Although deepeval
integrates with Pytest, we highly recommend you to AVOID executing LLMTestCase
s directly via the pytest
command to avoid any unexpected errors.
deepeval
allows you to run evaluations as if you're using Pytest via our Pytest integration. Simply create a test file:
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
dataset = EvaluationDataset(test_cases=[...])
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
answer_relevancy_metric = AnswerRelevancyMetric()
assert_test(test_case, [answer_relevancy_metric])
And run the test file in the CLI using deepeval test run
:
deepeval test run test_example.py
There are TWO mandatory and ONE optional parameter when calling the assert_test()
function:
test_case
: anLLMTestCase
metrics
: a list of metrics of typeBaseMetric
- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics. Defaulted toTrue
.
You can find the full documentation on deepeval test run
, including how to use it with tracing, on the running LLM-evals page.
@pytest.mark.parametrize
is a decorator offered by Pytest. It simply loops through your EvaluationDataset
to evaluate each test case individually.
You can include the deepeval test run
command as a step in a .yaml
file in your CI/CD workflows to run pre-deployment checks on your LLM application.
Evaluating Without Pytest
Alternately, you can use deepeval
's evaluate
function. This approach avoids the CLI (if you're in a notebook environment), and allows for parallel test execution as well.
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(test_cases=[...])
answer_relevancy_metric = AnswerRelevancyMetric()
evaluate(dataset, [answer_relevancy_metric])
There are TWO mandatory and SIX optional parameters when calling the evaluate()
function:
test_cases
: a list ofLLMTestCase
s ORConversationalTestCase
s, or anEvaluationDataset
. You cannot evaluateLLMTestCase
/MLLMTestCase
s andConversationalTestCase
s in the same test run.metrics
: a list of metrics of typeBaseMetric
.- [Optional]
hyperparameters
: a dict of typedict[str, Union[str, int, float]]
. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI. - [Optional]
identifier
: a string that allows you to better identify your test run on Confident AI. - [Optional]
async_config
: an instance of typeAsyncConfig
that allows you to customize the degree concurrency during evaluation. Defaulted to the defaultAsyncConfig
values. - [Optional]
display_config
:an instance of typeDisplayConfig
that allows you to customize what is displayed to the console during evaluation. Defaulted to the defaultDisplayConfig
values. - [Optional]
error_config
: an instance of typeErrorConfig
that allows you to customize how to handle errors during evaluation. Defaulted to the defaultErrorConfig
values. - [Optional]
cache_config
: an instance of typeCacheConfig
that allows you to customize the caching behavior during evaluation. Defaulted to the defaultCacheConfig
values.
You can find the full documentation on evaluate()
, including how to use it with tracing, on the running LLM-evals page.
You can also replace dataset
with a list of test cases, as shown in the test cases section.
Evaluating Nested Components
You can also run metrics on nested components by setting up tracing in deepeval
, and requires under 10 lines of code:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe, update_current_span_test_case
@observe(metrics=[AnswerRelevancyMetric()])
def complete(query: str):
response = openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": query}]).choices[0].message["content"]
update_current_span_test_case(
test_case=LLMTestCase(input=query, output=response)
)
return response
This is very useful especially if you:
- Want to run a different set of metrics on differnet components
- Wish to evaluate multiple components at once
- Don't want to rewrite your codebase just to bubble up returned variables to create an
LLMTestCase
You can also simply configure deepeval
to not run any metrics when you're running your LLM application in production. For the full guide on evaluating with tracing, visit this page.