End-to-End

End-to-end evaluation assesses the "observable" inputs and outputs of your LLM application - it is what users see, and treats your LLM application as a black-box. For simple LLM applications like basic RAG pipelines with "flat" architectures that can be represented by a single LLMTestCase, end-to-end evaluation is ideal:

Common use cases that are suitable for end-to-end evaluation include (not inclusive):

RAG QA
PDF extraction
Writing assitants
Summarization
etc.

You'll notice that use cases with simplier architectures are more suited for end-to-end evaluation. However, if your system is an extremely complex agentic workflow, you might also find end-to-end evaluation more suitable as you'll might conclude that that component-level evaluation gives you too much noise in its evaluation results.

info

Most of what you saw in deepeval's quickstart is end-to-end evaluation.

Prerequisites

Select metrics

You'll need to select the appropriate metrics and ensure your LLM app returns the required fields to create end-to-end LLMTestCase. For example, AnswerRelevancyMetric() expects input and actual_output, while FaithfulnessMetric() also requires retrieval_context.

You should first read the metrics section to understand which metrics are suitable for your use case, but the general rule of thumb is to include no more than 5 metrics, with 2-3 system specific, generic metrics and 1-2 use case specific, custom metrics.

If you're unsure, feel free to ask the team and get some recommendations in discord.

Setup LLM application

note

You'll need to setup your LLM application to return the test case parameters required by the metrics you've chosen above.

We'll be using this LLM application in this example which has a simple, "flat" RAG architecture to demonstrate how to run end-to-end evaluations on it using deepeval:

somewhere.py
from typing import List
import openai

def your_llm_app(input: str):
    def retriever(input: str):
        return ["Hardcoded text chunks from your vector database"]

    def generator(input: str, retrieved_chunks: List[str]):
        res = openai.ChatCompletion.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Use the provided context to answer the question."},
                {"role": "user", "content": "\n\n".join(retrieved_chunks) + "\n\nQuestion: " + input}
            ]
        ).choices[0].message["content"]
        return res

    retrieval_context = retriever(input)
    return generator(input, retrieval_context), retrieval_context


print(your_llm_app("How are you?"))

If you find it inconvenient to return variables just for creating LLMTestCases, setup LLM tracing instead, which also allows you to debug end-to-end evals on Confident AI.

Run End-to-End Evals

Running an end-to-end LLM evaluation creates a test run — a collection of test cases that benchmarks your LLM application at a specific point in time. You would typically:

Loop through a list of Goldens
Invoke your LLM app with each golden’s input
Generate a set of test cases ready for evaluation

Once the evaluation metrics have been applied to your test cases, you get a completed test run.

You can run end-to-end LLM evaluations in either:

CI/CD pipelines using deepeval test run, or
Python scripts using the evaluate() function

Both gives you exactly the same functionality, and integrates 100% with Confident AI for sharable testing reports on the cloud.

Use `evaluate()` in Python scripts

deepeval offers an evaluate() function that allows you to evaluate end-to-end LLM interactions through a list of test cases and metrics. Each test case will be evaluated by each and every metric you define in metrics, and a test case passes only if all metrics passes.

main.py
from somewhere import your_llm_app # Replace with your LLM app

from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate

goldens = [Golden(input="...")]


# Create test cases from goldens
test_case = []
for golden in goldens:
    res, text_chunks = your_llm_app(golden.input)
    test_case = LLMTestCase(input=golden.input, actual_output=res, retrieval_context=text_chunks)
    test_cases.append(test_case)


# Evaluate end-to-end
evaluate(test_cases=test_cases, metrics=[AnswerRelevancyMetric()])

There are TWO mandatory and SIX optional parameters when calling the evaluate() function for END-TO-END evaluation:

test_cases: a list of LLMTestCases OR ConversationalTestCases, or an EvaluationDataset. You cannot evaluate LLMTestCase/MLLMTestCases and ConversationalTestCases in the same test run.
metrics: a list of metrics of type BaseMetric.
[Optional] hyperparameters: a dict of type dict[str, Union[str, int, float]]. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI.
[Optional] identifier: a string that allows you to better identify your test run on Confident AI.
[Optional] async_config: an instance of type AsyncConfig that allows you to customize the degree of concurrency during evaluation. Defaulted to the default AsyncConfig values.
[Optional] display_config:an instance of type DisplayConfig that allows you to customize what is displayed to the console during evaluation. Defaulted to the default DisplayConfig values.
[Optional] error_config: an instance of type ErrorConfig that allows you to customize how to handle errors during evaluation. Defaulted to the default ErrorConfig values.
[Optional] cache_config: an instance of type CacheConfig that allows you to customize the caching behavior during evaluation. Defaulted to the default CacheConfig values.

This is exactly the same as assert_test() in deepeval test run, but in a difference interface.

Use `deepeval test run` in CI/CD pipelines

caution

The usual pytest command would still work but is highly not recommended. deepeval test run adds a range of functionalities on top of Pytest for unit-testing LLMs, which is enabled by 8+ optional flags. Users typically include deepeval test run as a command in their .yaml files for pre-deployment checks in CI/CD pipelines (example here).

deepeval allows you to unit-test in CI/CD pipelines using the deepeval test run command as if you're using Pytest via deepeval's Pytest integration.

test_llm_app.py
from somewhere import your_llm_app # Replace with your LLM app
import pytest

from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval import assert_test

goldens = [Golden(input="...")]

# Loop through goldens using pytest
@pytest.mark.parametrize("golden", goldens)
def test_llm_app(golden: Golden):
    res, text_chunks = your_llm_app(golden.input)
    test_case = LLMTestCase(input=golden.input, actual_output=res, retrieval_context=text_chunks)
    assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])

deepeval test run test_llm_app.py

There are TWO mandatory and ONE optional parameter when calling the assert_test() function for END-TO-END evaluation:

test_case: an LLMTestCase.
metrics: a list of metrics of type BaseMetric.
[Optional] run_async: a boolean which when set to True, enables concurrent evaluation of all metrics in @observe. Defaulted to True.

Click here to learn about different optional flags available to deepeval test run to customize asynchronous behaviors, error handling, etc.

tip

If you're logged into Confident AI, you'll also receive a fully sharable LLM testing report on the cloud. Run this in the CLI:

deepeval login

Prerequisites​

Select metrics​

Setup LLM application​

Run End-to-End Evals​

Use evaluate() in Python scripts​

Use deepeval test run in CI/CD pipelines​