Test Cases

Quick Summary

A test case is a blueprint provided by deepeval to unit test LLM outputs. There are two types of test cases in deepeval: LLMTestCase and ConversationalTestCase.

caution

Throughout this documentation, you should assume the term 'test case' refers to an LLMTestCase instead of a ConversationalTestCase.

An LLMTestCase is the most prominent type of test case in deepeval and represents a single, atomic unit of interaction with your LLM app. It has NINE parameters:

input
actual_output
[Optional] expected_output
[Optional] context
[Optional] retrieval_context
[Optional] tools_called
[Optional] expected_tools
[Optional] token_cost
[Optional] completion_time

Here's an example implementation of an LLMTestCase:

from deepeval.test_case import LLMTestCase, ToolCall

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    expected_output="You're eligible for a 30 day refund at no extra cost.",
    actual_output="We offer a 30-day full refund at no extra cost.",
    context=["All customers are eligible for a 30 day full refund at no extra cost."],
    retrieval_context=["Only shoes can be refunded."],
    tools_called=[ToolCall(name="WebSearch")]
)

info

Since deepeval is an LLM evaluation framework, the input and actual_output are always mandatory. However, this does not mean they are necessarily used for evaluation, and you can also add additional parameters such as the tools_called for each LLMTestCase.

To get your own sharable testing report with deepeval, sign up to Confident AI, or run deepeval login in the CLI:

deepeval login

What Is An LLM "Interaction"?

An LLM interaction is any discrete exchange of information between components of your LLM system — from a full user request to a single internal step. The scope of interaction is arbitrary and is entirely up to you.

note

Since an LLMTestCase represents a single, atomic unit of interaction in your LLM app, it is important to understand what this means.

Let’s take this LLM system as an example:

There are different ways you scope an interaction:

Agent-Level: The entire process initiated by the agent, including the RAG pipeline and web search tool usage
RAG Pipeline: Just the RAG flow — retriever + LLM
- Retriever: Only test whether relevant documents are being retrieved
- LLM: Focus purely on how well the LLM generates text from the input/context

An interaction is where you want to define your LLMTestCase. For example, when using RAG-specific metrics like AnswerRelevancyMetric, FaithfulnessMetric, or ContextualRelevancyMetric, the interaction is best scoped at the RAG pipeline level.

In this case:

input should be the user question or text to embed
retrieval_context should be the retrieved documents from the retriever
actual_output should be the final response generated by the LLM

If you would want to evaluate using the ToolCorrectnessMetric however, you'll need to create an LLMTestCase at the Agent-Level, and supply the tools_called parameter instead:

We'll go through the requirements for an LLMTestCase before showing how to create an LLMTestCase for an interaction.

tip

For users starting out, scoping the interaction as the overall LLM application will be the easiest way to run evals.

LLM Test Case

An LLMTestCase in deepeval can be used to unit test LLM application (which can just be an LLM itself) outputs, which includes use cases such as RAG and LLM agents (for individual components, agents within agents, or the agent altogether). It contains the necessary information (tools_called for agents, retrieval_context for RAG, etc.) to evaluate your LLM application for a given input.

Different metrics will require a different combination of LLMTestCase parameters, but they all require an input and actual_output - regardless of whether they are used for evaluation for not. For example, you won't need expected_output, context, tools_called, and expected_tools if you're just measuring answer relevancy, but if you're evaluating hallucination you'll have to provide context in order for deepeval to know what the ground truth is.

With the exception of conversational metrics, which are metrics to evaluate conversations instead of individual LLM responses, you can use any LLM evaluation metric deepeval offers to evaluate an LLMTestCase.

note

You cannot use conversational metrics to evaluate an LLMTestCase. Conveniently, most metrics in deepeval are non-conversational.

Keep reading to learn which parameters in an LLMTestCase are required to evaluate different aspects of an LLM applications - ranging from pure LLMs, RAG pipelines, and even LLM agents.

Input

The input mimics a user interacting with your LLM application. The input is the direct input to your prompt template, and so SHOULD NOT CONTAIN your prompt template.

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="Why did the chicken cross the road?",
    # Replace this with your actual LLM application
    actual_output="Quite frankly, I don't want to know..."
)

tip

Not all inputs should include your prompt template, as this is determined by the metric you're using. Furthermore, the input should NEVER be a json version of the list of messages you are passing into your LLM.

If you're logged into Confident AI, you can associate hyperparameters such as prompt templates with each test run to easily figure out which prompt template gives the best actual_outputs for a given input:

deepeval login

test_file.py
import deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_llm():
    test_case = LLMTestCase(input="...", actual_output="...")
    answer_relevancy_metric = AnswerRelevancyMetric()
    assert_test(test_case, [answer_relevancy_metric])

# You should aim to make these values dynamic
@deepeval.log_hyperparameters(model="gpt-4o", prompt_template="...")
def hyperparameters():
    # You can also return an empty dict {} if there's no additional parameters to log
    return {
        "temperature": 1,
        "chunk size": 500
    }

deepeval test run test_file.py

Actual Output

The actual_output is simply what your LLM application returns for a given input. This is what your users are going to interact with. Typically, you would import your LLM application (or parts of it) into your test file, and invoke it at runtime to get the actual output.

# A hypothetical LLM application example
import chatbot

input = "Why did the chicken cross the road?"

test_case = LLMTestCase(
    input=input,
    actual_output=chatbot.run(input)
)

note

You may also choose to evaluate with precomputed actual_outputs, instead of generating actual_outputs at evaluation time.

Expected Output

The expected_output is literally what you would want the ideal output to be. Note that this parameter is optional depending on the metric you want to evaluate.

The expected output doesn't have to exactly match the actual output in order for your test case to pass since deepeval uses a variety of methods to evaluate non-deterministic LLM outputs. We'll go into more details in the metrics section.

# A hypothetical LLM application example
import chatbot

input = "Why did the chicken cross the road?"

test_case = LLMTestCase(
    input=input,
    actual_output=chatbot.run(input),
    expected_output="To get to the other side!"
)

Context

The context is an optional parameter that represents additional data received by your LLM application as supplementary sources of golden truth. You can view it as the ideal segment of your knowledge base relevant to a specific input. Context allows your LLM to generate customized outputs that are outside the scope of the data it was trained on.

In RAG applications, contextual information is typically stored in your selected vector database, which is represented by retrieval_context in an LLMTestCase and is not to be confused with context. Conversely, for a fine-tuning use case, this data is usually found in training datasets used to fine-tune your model. Providing the appropriate contextual information when constructing your evaluation dataset is one of the most challenging part of evaluating LLMs, since data in your knowledge base can constantly be changing.

Unlike other parameters, a context accepts a list of strings.

# A hypothetical LLM application example
import chatbot

input = "Why did the chicken cross the road?"

test_case = LLMTestCase(
    input=input,
    actual_output=chatbot.run(input),
    expected_output="To get to the other side!",
    context=["The chicken wanted to cross the road."]
)

note

Often times people confuse expected_output with context since due to their similar level of factual accuracy. However, while both are (or should be) factually correct, expected_output also takes aspects like tone and linguistic patterns into account, whereas context is strictly factual.

Retrieval Context

The retrieval_context is an optional parameter that represents your RAG pipeline's retrieval results at runtime. By providing retrieval_context, you can determine how well your retriever is performing using context as a benchmark.

# A hypothetical LLM application example
import chatbot

input = "Why did the chicken cross the road?"

test_case = LLMTestCase(
    input=input,
    actual_output=chatbot.run(input),
    expected_output="To get to the other side!",
    context=["The chicken wanted to cross the road."],
    retrieval_context=["The chicken liked the other side of the road better"]
)

note

Remember, context is the ideal retrieval results for a given input and typically come from your evaluation dataset, whereas retrieval_context is your LLM application's actual retrieval results. So, while they might look similar at times, they are not the same.

Tools Called

The tools_called parameter is an optional parameter that represents the tools your LLM agent actually invoked during execution. By providing tools_called, you can evaluate how effectively your LLM agent utilized the tools available to it.

note

The tools_called parameter accepts a list of ToolCall objects.

class ToolCall(BaseModel):
    name: str
    description: Optional[str] = None
    reasoning: Optional[str] = None
    output: Optional[Any] = None
    input_parameters: Optional[Dict[str, Any]] = None

A ToolCall object accepts 1 mandatory and 4 optional parameters:

name: a string representing the name of the tool.
[Optional] description: a string describing the tool's purpose.
[Optional] reasoning: A string explaining the agent's reasoning to use the tool.
[Optional] output: The tool's output, which can be of any data type.
[Optional] input_parameters: A dictionary with string keys representing the input parameters (and respective values) passed into the tool function.

# A hypothetical LLM application example
import chatbot

test_case = LLMTestCase(
    input="Why did the chicken cross the road?",
    actual_output=chatbot.run(input),
    # Replace this with the tools that were actually used
    tools_called=[
        ToolCall(
            name="Calculator Tool"
            description="A tool that calculates mathematical equations or expressions.",
            input={"user_input": "2+3"}
            output=5
        ),
        ToolCall(
            name="WebSearch Tool"
            reasoning="Knowledge base does not detail why the chicken crossed the road."
            input={"search_query": "Why did the chicken crossed the road?"}
            output="Because it wanted to, duh."
        )
    ]
)

info

tools_called and expected_tools are LLM test case parameters that are utilized only in agentic evaluation metrics. These parameters allow you to assess the tool usage correctness of your LLM application and ensure that it meets the expected tool usage standards.

Expected Tools

The expected_tools parameter is an optional parameter that represents the tools that ideally should have been used to generate the output. By providing expected_tools, you can assess whether your LLM application used the tools you anticipated for optimal performance.

# A hypothetical LLM application example
import chatbot

input = "Why did the chicken cross the road?"

test_case = LLMTestCase(
    input=input,
    actual_output=chatbot.run(input),
    # Replace this with the tools that were actually used
    tools_called=[
        ToolCall(
            name="Calculator Tool"
            description="A tool that calculates mathematical equations or expressions.",
            input={"user_input": "2+3"}
            output=5
        ),
        ToolCall(
            name="WebSearch Tool"
            reasoning="Knowledge base does not detail why the chicken crossed the road."
            input={"search_query": "Why did the chicken crossed the road?"}
            output="Because it wanted to, duh."
        )
    ]
    expected_tools=[
        ToolCall(
            name="WebSearch Tool"
            reasoning="Knowledge base does not detail why the chicken crossed the road."
            input={"search_query": "Why did the chicken crossed the road?"}
            output="Because it needed to escape from the hungry humans."
        )
    ]
)

Token cost

The token_cost is an optional parameter and is of type float that allows you to log the cost of a particular LLM interaction for a particular LLMTestCase. No metrics use this parameter by default, and it is most useful for either:

Building custom metrics that relies on token_cost
Logging token_cost on Confident AI

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(token_cost=1.32, ...)

Completion Time

The completion_time is an optional parameter and is similar to the token_cost is of type float that allows you to log the time in SECONDS it took for a LLM interaction for a particular LLMTestCase to complete. No metrics use this parameter by default, and it is most useful for either:

Building custom metrics that relies on completion_time
Logging completion_time on Confident AI

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(completion_time=7.53, ...)

Create Your First `LLMTestCase`

Since an LLMTestCase represents an LLM interaction, you can either create LLMTestCases for the overall LLM application (best for simplier systems), or for nested components within your system. Let’s take this LLM system as an example:

Which has this implementation:

from typing import List

def web_search(query: str) -> str:
    # <--Include implementation to search web here-->
    return "Latest search results for: " + query

def retrieve_documents(query: str) -> List[str]:
    # <--Include implementation to fetch from vector database here-->
    return ["Document 1: This is relevant information about the query."]

def generate_response(input: str) -> str:
    # <--Include format prompts and call your LLM provider here-->
    return "Generated response based on the prompt: " + input

def rag_pipeline(query: str) -> str:
    # Calls retriever and llm
    context = "\n".join(retrieve_documents(query))
    response = generate_response(f"Context: {context}\nQuery: {query}")
    return response

def research_agent(query: str) -> str:
    # Calls RAG pipeline
    initial_response = rag_pipeline(query)

    # Use web search tool on the results
    search_results = web_search(initial_response)

    # Generate final response incorporating both RAG and search results
    final_response = generate_response(
        f"Initial response: {initial_response}\n"
        f"Additional search results: {search_results}\n"
        f"Query: {query}"
    )
    return final_response


research_agent("What is the weather like in San Francisco?")

Remember, an interaction and component in your LLM app is arbitrary and entirely up to you to define the scope of.

info

You will find yourself having different evaluation metrics for different components/interactions.

For The Overall LLM App

Creating an LLMTestCase for the overall LLM app is straightforward (and in fact the example we used in the quickstart):

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
...

input = "What is the weather like in San Francisco?"

test_case = LLMTestCase(
    input=input,
    actual_output=research_agent("What is the weather like in San Francisco?")
)

evaluate(test_cases=[test_case], metrics=[AnswerRelevancyMetric()])

However, you'll notice that we're missing out on a lot of information in our LLM system, including the quality of the RAG pipeline. In fact, it is nearly impossible to evaluate the retriever of our RAG pipeline without bubbling the returned retrieved text chunks back to the top-level function.

For Deeply Nested Components

To evaluate an LLM interaction for a nested component, such as the RAG Pipeline in our research_agent example, involves adding deepeval's @observe decoraters to your application, and setting the LLMTestCase at runtime instead. This is what it looks like:

tracing_example.py
from typing import List

from deepeval.test_case import LLMTestCase
from deepeval.tracing import (
    observe,
    update_current_span_test_case,
    ContextualRelevancyMetric,
)

def web_search(query: str) -> str:
    # <--Include implementation to search web here-->
    return "Latest search results for: " + query


def retrieve_documents(query: str) -> List[str]:
    # <--Include implementation to fetch from vector database here-->
    return ["Document 1: This is relevant information about the query."]


def generate_response(input: str) -> str:
    # <--Include format prompts and call your LLM provider here-->
    return "Generated response based on the prompt: " + input


@observe(
    type="custom", name="RAG Pipeline", metrics=[ContextualRelevancyMetric()]
)
def rag_pipeline(query: str) -> str:
    # Calls retriever and llm
    docs = retrieve_documents(query)
    context = "\n".join(docs)
    response = generate_response(f"Context: {context}\nQuery: {query}")

    update_current_span_test_case(
        test_case=LLMTestCase(input=query, actual_output=response, retrieval_context=docs)
    )
    return response


@observe(type="agent")
def research_agent(query: str) -> str:
    # Calls RAG pipeline
    initial_response = rag_pipeline(query)

    # Use web search tool on the results
    search_results = web_search(initial_response)

    # Generate final response incorporating both RAG and search results
    final_response = generate_response(
        f"Initial response: {initial_response}\n"
        f"Additional search results: {search_results}\n"
        f"Query: {query}"
    )
    return final_response

info

You could have also added the @observe decorator to web_search(), retrieve_documents(), and generate_response() function for example if you wish to trace and/or evaluate those components as well.

As you can see, with less than 10 lines of code you can immediately trace and evaluate nested components in your LLM app, no matter how complicated your system may be. You can also decorate as many or little functions you wish in your codebase to evaluate more interactions in your LLM system.

Finally, you would define a list of Goldens instead of an LLMTestCase to initiate an evaluation using your @observed LLM application in the evaluate() function. Goldens are basically test cases that are not ready for evaluation. That's why we start an evaluation with a list of goldens and create the LLMTestCases at evaluation time.

from deepeval.dataset import Golden
...

golden = Golden(input="What is the weather like in San Francisco?")

evaluate(goldens=[golden], traceable_callback=research_agent)

Conversational Test Case

A ConversationalTestCase in deepeval is simply a list of conversation turns represented by a list of LLMTestCases. While an LLMTestCase represents an individual LLM system interaction, a ConversationalTestCase encapsulates a series of LLMTestCases that make up an LLM-based conversation. This is particular useful if you're looking to for example evaluate a conversation between a user and an LLM-based chatbot.

While you cannot use a conversational metric on an LLMTestCase, a ConversationalTestCase can be evaluated using both non-conversational and conversational metrics.

from deepeval.test_case import LLMTestCase, ConversationalTestCase

llm_test_case = LLMTestCase(
    # Replace this with your user input
    input="Why did the chicken cross the road?",
    # Replace this with your actual LLM application
    actual_output="Quite frankly, I don't want to know..."
)

test_case = ConversationalTestCase(turns=[llm_test_case])

note

Similar to how the term 'test case' refers to an LLMTestCase if not explicitly specified, the term 'metrics' also refer to non-conversational metrics throughout deepeval.

Turns

The turns parameter is a list of LLMTestCases and is basically a list of messages/exchanges in a user-LLM conversation. Different conversational metrics will require different LLM test case parameters for evaluation, while regular LLM system metrics will take the last LLMTestCase in a turn to carry out evaluation.

from deepeval.test_case import LLMTestCase, ConversationalTestCase

test_case = ConversationalTestCase(turns=[LLMTestCase(...)])

Did you know?

You can apply both non-conversational and conversational metrics to a ConversationalTestCase. Conversational metrics evaluate the entire conversational as a whole, and non-conversational metrics (which are metrics used for individual LLMTestCases), when applied to a ConversationalTestCase, will evaluate the last turn in a ConversationalTestCase. This is because it is more useful to evaluate the last best LLM actual_output given the previous conversation context, instead of all individual turns in a ConversationalTestCase.

Chatbot Role

The chatbot_role parameter is an optional parameter that specifies what role the chatbot is supposed to play. This is currently only required for the RoleAdherenceMetric, where it is particularly useful for a role-playing evaluation use case.

from deepeval.test_case import LLMTestCase, ConversationalTestCase

test_case = ConversationalTestCase(
    chatbot_role="...",
    turns=[LLMTestCase(...)]
)

MLLM Test Case

An MLLMTestCase in deepeval is designed to unit test outputs from MLLM (Multimodal Large Language Model) applications. Unlike an LLMTestCase, which only handles textual parameters, an MLLMTestCase accepts both text and image inputs and outputs. This is particularly useful for evaluating tasks such as text-to-image generation or MLLM-driven image editing.

caution

You may only evaluate MLLMTestCases using multimodal metrics such as VIEScore.

from deepeval.test_case import MLLMTestCase, MLLMImage

mllm_test_case = MLLMTestCase(
    # Replace this with your user input
    input=["Change the color of the shoes to blue.", MLLMImage(url="./shoes.png", local=True)]
    # Replace this with your actual MLLM application
    actual_output=["The original image of red shoes now shows the shoes in blue.", MLLMImage(url="https://shoe-images.com/edited-shoes", local=False)]
)

Input

The input mimics a user interacting with your MLLM application. Like an LLMTestCase input, an MLLMTestCase input is the direct input to your prompt template, and so SHOULD NOT CONTAIN your prompt template.

from deepeval.test_case import MLLMTestCase, MLLMImage

mllm_test_case = MLLMTestCase(
    input=["Change the color of the shoes to blue.", MLLMImage(url="./shoes.png", local=True)]
)

info

The input parameter accepts a list of strings and MLLMImages, which is a class specific deepeval. The MLLMImage class accepts an image path and automatically sets the local attribute to true or false depending on whether the image is locally stored or hosted online. By default, local is set to false.

from deepeval.test_case import MLLMImage

# Example of using the MLLMImage class
image_input = MLLMImage(image_path="path/to/image.jpg")

# image_input.local will automatically be set to `true` if the image is local
# and `false` if the image is hosted online.

Actual Output

The actual_output is simply what your MLLM application returns for a given input. Similarly, it also accepts a list of strings and MLLMImages.

from deepeval.test_case import MLLMTestCase, MLLMImage

mllm_test_case = MLLMTestCase(
    input=["Change the color of the shoes to blue.", MLLMImage(url="./shoes.png", local=True)],
    actual_output=["The original image of red shoes now shows the shoes in blue.", MLLMImage(url="https://shoe-images.com/edited-shoes", local=False)]
)

Assert A Test Case

Before we begin going through the final sections, we highly recommend you to login to Confident AI (the platform powering deepeval) via the CLI. This way, you can keep track of all evaluation results generated each time you execute deepeval test run.

deepeval login

Similar to Pytest, deepeval allows you to assert any test case you create by calling the assert_test function by running deepeval test run via the CLI.

A test case passes only if all metrics passes. Depending on the metric, a combination of input, actual_output, expected_output, context, and retrieval_context is used to ascertain whether their criterion have been met.

test_assert_example.py
# A hypothetical LLM application example
import chatbot
import deepeval
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

def test_assert_example():
    input = "Why did the chicken cross the road?"
    test_case = LLMTestCase(
        input=input,
        actual_output=chatbot.run(input),
        context=["The chicken wanted to cross the road."],
    )
    metric = HallucinationMetric(threshold=0.7)
    assert_test(test_case, metrics=[metric])


# Optionally log hyperparameters to pick the best hyperparameter for your LLM application
# using Confident AI. (run `deepeval login` in the CLI to login)
@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
    # Return a dict to log additional hyperparameters.
    # You can also return an empty dict {} if there's no additional parameters to log
    return {
        "temperature": 1,
        "chunk size": 500
    }

There are TWO mandatory and ONE optional parameter when calling the assert_test() function:

test_case: an LLMTestCase
metrics: a list of metrics of type BaseMetric
[Optional] run_async: a boolean which when set to True, enables concurrent evaluation of all metrics. Defaulted to True.

You can find the full documentation on deepeval test run, including how to use it with tracing, on the running LLM-evals page.

info

The run_async parameter overrides the async_mode property of all metrics being evaluated. The async_mode property, as you'll learn later in the metrics section, determines whether each metric can execute asynchronously.

To execute the test cases, run deepeval test run via the CLI, which uses deepeval's Pytest integration under the hood to execute these tests. You can also include an optional -n flag follow by a number (that determines the number of processes that will be used) to run tests in parallel.

deepeval test run test_assert_example.py -n 4

You can include the deepeval test run command as a step in a .yaml file in your CI/CD workflows to run pre-deployment checks on your LLM application.

Evaluate Test Cases in Bulk

Lastly, deepeval offers an evaluate function to evaluate multiple test cases at once, which similar to assert_test but without the need for Pytest or the CLI.

# A hypothetical LLM application example
import chatbot
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input=input,
    actual_output=chatbot.run(input),
    context=["The chicken wanted to cross the road."],
)

metric = HallucinationMetric(threshold=0.7)
evaluate([test_case], [metric])

There are TWO mandatory and SIX optional parameters when calling the evaluate() function:

test_cases: a list of LLMTestCases OR ConversationalTestCases, or an EvaluationDataset. You cannot evaluate LLMTestCase/MLLMTestCases and ConversationalTestCases in the same test run.
metrics: a list of metrics of type BaseMetric.
[Optional] hyperparameters: a dict of type dict[str, Union[str, int, float]]. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI.
[Optional] identifier: a string that allows you to better identify your test run on Confident AI.
[Optional] async_config: an instance of type AsyncConfig that allows you to customize the degree concurrency during evaluation. Defaulted to the default AsyncConfig values.
[Optional] display_config:an instance of type DisplayConfig that allows you to customize what is displayed to the console during evaluation. Defaulted to the default DisplayConfig values.
[Optional] error_config: an instance of type ErrorConfig that allows you to customize how to handle errors during evaluation. Defaulted to the default ErrorConfig values.
[Optional] cache_config: an instance of type CacheConfig that allows you to customize the caching behavior during evaluation. Defaulted to the default CacheConfig values.

You can find the full documentation on evaluate(), including how to use it with tracing, on the running LLM-evals page.

DID YOU KNOW?

Similar to assert_test, evaluate allows you to log and view test results and the hyperparameters associated with each on Confident AI.

deepeval login

from deepeval import evaluate
...

evaluate(
    test_cases=[test_case],
    metrics=[metric],
    hyperparameters={"model": "gpt-4o", "prompt template": "..."}
)

For more examples of evaluate, visit the datasets section.

Labeling Test Cases for Confident AI

If you're using Confident AI, the optional name parameter allows you to provide a string identifier to label LLMTestCases and ConversationalTestCases for you to easily search and filter for on Confident AI. This is particularly useful if you're importing test cases from an external datasource.

from deepeval.test_case import LLMTestCase, ConversationalTestCase

test_case = LLMTestCase(name="my-external-unique-id", ...)
convo_test_case = ConversationalTestCase(name="my-external-unique-id", ...)

Quick Summary​

What Is An LLM "Interaction"?​

LLM Test Case​

Input​

Actual Output​

Expected Output​

Context​

Retrieval Context​

Tools Called​

Expected Tools​

Token cost​

Completion Time​

Create Your First LLMTestCase​

For The Overall LLM App​

For Deeply Nested Components​

Conversational Test Case​

Turns​

Chatbot Role​

MLLM Test Case​

Input​

Actual Output​

Assert A Test Case​

Evaluate Test Cases in Bulk​

Labeling Test Cases for Confident AI​

Quick Summary

What Is An LLM "Interaction"?

LLM Test Case

Input

Actual Output

Expected Output

Context

Retrieval Context

Tools Called

Expected Tools

Token cost

Completion Time

Create Your First `LLMTestCase`

For The Overall LLM App

For Deeply Nested Components

Conversational Test Case

Turns

Chatbot Role

MLLM Test Case

Input

Actual Output

Assert A Test Case

Evaluate Test Cases in Bulk

Labeling Test Cases for Confident AI