Skip to main content

Component-Level

Component-level evaluation assess individual units of LLM interaction between internal components such as retrievers, tool calls, LLM generations, or even agents interacting with other agents, rather than treating the LLM app as a black box.

note

In end-to-end evaluation, your LLM application is treated as a black-box and evaluation is encapsulated by the overall system inputs and outputs in the form of an LLMTestCase.

If your application has nested components or a structure that a simple LLMTestCase can't easily handle, component-level evaluation allows you to apply different metrics to different components in your LLM application.

ok

info

You would still be creating LLMTestCases, but this time for individual components at runtime instead of the overall system.

Common use cases that are suitable for component-level evaluation include (not inclusive):

  • Chatbots/conversational agents
  • Autonomous agents
  • Text-SQL
  • Code generation
  • etc.

The trend you'll notice is use cases that are more complex in architecture are more suited for component-level evaluation.

Prerequisites

Select metrics

Unlike end-to-end evaluation, you will need to select a set of appropriate metrics for each component you want to evaluate, and ensure the LLMTestCases that you create in that component contains all the necessary parameters.

You should first read the metrics section to understand which metrics are suitable for which components, but alternatively you can also join our discord to ask us directly.

info

In component-level evaluation, there are more metrics to select as there are more individual components to evaluate.

Setup LLM application

Unlike end-to-end evaluation, where setting up your LLM application requires rewriting some parts of your code to return certain variables for testing, component-level testing is as simple as adding an @observe decorator to apply different metrics at different component scopes.

YOU MUST KNOW

The process of adding the @observe decorating in your app is known as tracing, which we will learn how to setup fully in the next section.

If you're worried about how tracing via @observe can affect your application, click here.

An @observe decorator creates a span, and the overall collection of spans is called a trace. We'll trace this example LLM application to demonstrate how to run component-level evaluations using deepeval in two lines of code:

somewhere.py
from typing import List
import openai

from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def your_llm_app(input: str):
def retriever(input: str):
return ["Hardcoded text chunks from your vector database"]

@observe(metrics=[AnswerRelevancyMetric()])
def generator(input: str, retrieved_chunks: List[str]):
res = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Use the provided context to answer the question."},
{"role": "user", "content": "\n\n".join(retrieved_chunks) + "\n\nQuestion: " + input}
]
).choices[0].message["content"]

# Create test case at runtime
update_current_span(test_case=LLMTestCase(input=input, actual_output=res))

return res

return generator(input, retriever(input))


print(your_llm_app("How are you?"))

If you compare this implementation to the previous one in end-to-end evaluation, you'll notice that tracing with deepeval's @observe means we don't have to return variables such as the retrieval_context in awkward places just to create end-to-end LLMTestCases.

caution

At this point, you can either pause and learn how to setup LLM tracing in the next section before continuing, or finish this section before moving onto tracing.

Run Component-Level Evals

Once your LLM application is decorated with @observe, you'll be able to provide it as an observed_callback and invoked it with Goldens to create a list of test cases within your @observe decorated spans. These test cases are then evaluated using the respective metrics to create a test run.

You can run component=level LLM evaluations in either:

  • CI/CD pipelines using deepeval test run, or
  • Python scripts using the evaluate() function

Both gives you exactly the same functionality, and integrates 100% with Confident AI for sharable testing reports on the cloud.

Use evaluate() in Python scripts

To use evaluate() for component-level testing, supply a list of Goldens instead of LLMTestCases, and an observed_callback which is the @observe decorated LLM application you wish to run evals on.

main.py
from somewhere import your_llm_app # Replace with your LLM app

from deepeval.dataset import Golden
from deepeval import evaluate

# Goldens from your dataset
goldens = [Golden(input="...")]

# Evaluate with `observed_callback`
evaluate(goldens=goldens, observed_callback=your_llm_app)

There are TWO mandatory and FIVE optional parameters when calling the evaluate() function for COMPONENT-LEVEL evaluation:

  • golden: a list of Goldens that you wish to invoke your observed_callback with.
  • observed_callback: a function callback that is your @observe decorated LLM application. There must be AT LEAST ONE metric within one of the metrics in your @observe decorated LLM application.
  • [Optional] identifier: a string that allows you to better identify your test run on Confident AI.
  • [Optional] async_config: an instance of type AsyncConfig that allows you to customize the degree concurrency during evaluation. Defaulted to the default AsyncConfig values.
  • [Optional] display_config:an instance of type DisplayConfig that allows you to customize what is displayed to the console during evaluation. Defaulted to the default DisplayConfig values.
  • [Optional] error_config: an instance of type ErrorConfig that allows you to customize how to handle errors during evaluation. Defaulted to the default ErrorConfig values.
  • [Optional] cache_config: an instance of type CacheConfig that allows you to customize the caching behavior during evaluation. Defaulted to the default CacheConfig values.
tip

You'll notice that unlike end-to-end evaluation, there is no declaration of metrics because those are defined in @observe in the metrics parameter, and there are no creation of LLMTestCases because it is handled at runtime by update_current_span in your LLM app.

Use deepeval test run in CI/CD pipelines

deepeval allows you to run evaluations as if you're using Pytest via our Pytest integration.

test_llm_app.py
from somewhere import your_llm_app # Replace with your LLM app
import pytest
from deepeval.dataset import Golden
from deepeval import assert_test

# Goldens from your dataset
goldens = [Golden(input="...")]

# Loop through goldens using pytest
@pytest.mark.parametrize("golden", goldens)
def test_llm_app(golden: Golden):
assert_test(golden=golden, observed_callback=your_llm_app)
info

Similar to the evaluate() function, assert_test() for component-level evaluation does not need:

  • Declaration of metrics because those are defined at the span level in the metrics parameter.
  • Creation of LLMTestCases because it is handled at runtime by update_current_span in your LLM app.

Finally, don't forget to run the test file in the CLI:

deepeval test run test_llm_app.py

There are TWO mandatory and ONE optional parameter when calling the assert_test() function for COMPONENT-LEVEL evaluation:

  • golden: the Golden that you wish to invoke your observed_callback with.
  • observed_callback: a function callback that is your @observe decorated LLM application. There must be AT LEAST ONE metric within one of the metrics in your @observe decorated LLM application.
  • [Optional] run_async: a boolean which when set to True, enables concurrent evaluation of all metrics in @observe. Defaulted to True.

Click here to learn about different optional flags available to deepeval test run to customize asynchronous behaviors, error handling, etc.