Skip to main content

Quick Introduction

DeepEval is an open-source evaluation framework for LLMs. DeepEval makes it extremely easy to build and iterate on LLM (applications) and was built with the following principles in mind:

  • Easily "unit test" LLM outputs in a similar way to Pytest.
  • Plug-and-use 30+ LLM-evaluated metrics, most with research backing.
  • Supports both end-to-end and component level evaluation.
  • Evaluation for RAG, agents, chatbots, and virtually any use case.
  • Synthetic dataset generation with state-of-the-art evolution techniques.
  • Metrics are simple to customize and covers all use cases.
  • Red team, safety scan LLM applications for security vulnerabilities.

Additionally, DeepEval has a cloud platform Confident AI, which allow teams to use DeepEval to evaluate, regression test, red team, and monitor LLM applications on the cloud.

Delivered by
Confident AI

Installation

In a newly created virtual environment, run:

pip install -U deepeval

deepeval runs evaluations locally on your environment. To keep your testing reports in a centralized place on the cloud, use Confident AI, the native evaluation platform for DeepEval:

deepeval login
tip

Confident AI is free and allows you to keep all evaluation results on the cloud. Sign up here.

Create Your First Test Run

Create a test file to run your first end-to-end evaluation.

An LLM test case in deepeval represents a single unit of LLM app interaction, and contains mandatory fields such as the input and actual_output (LLM generated output), and optional ones like expected_output.

LLM Test Case

Run touch test_example.py in your terminal and paste in the following code:

test_example.py
from deepeval import assert_test
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

def test_correctness():
correctness_metric = GEval(
name="Correctness",
criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
threshold=0.5
)
test_case = LLMTestCase(
input="I have a persistent cough and fever. Should I be worried?",
# Replace this with the actual output from your LLM application
actual_output="A persistent cough and fever could be a viral infection or something more serious. See a doctor if symptoms worsen or don't improve in a few days.",
expected_output="A persistent cough and fever could indicate a range of illnesses, from a mild viral infection to more serious conditions like pneumonia or COVID-19. You should seek medical attention if your symptoms worsen, persist for more than a few days, or are accompanied by difficulty breathing, chest pain, or other concerning signs."
)
assert_test(test_case, [correctness_metric])

Then, run deepeval test run from the root directory of your project to evaluate your LLM app end-to-end:

deepeval test run test_example.py

Congratulations! Your test case should have passed ✅ Let's breakdown what happened.

  • The variable input mimics a user input, and actual_output is a placeholder for what your application's supposed to output based on this input.
  • The variable expected_output represents the ideal answer for a given input, and GEval is a research-backed metric provided by deepeval for you to evaluate your LLM output's on any custom metric with human-like accuracy.
  • In this example, the metric criteria is correctness of the actual_output based on the provided expected_output, but not all metrics require an expected_output.
  • All metric scores range from 0 - 1, which the threshold=0.5 threshold ultimately determines if your test have passed or not.

If you run more than one test run, you will be able to catch regressions by comparing test cases side-by-side. This is also made easier if you're using deepeval alongside Confident AI (see below for video demo).

info

Since almost all deepeval metrics including GEval are LLM-as-a-Judge metrics, you'll need to set your OPENAI_API_KEY as an env variable. You can also customize the model used for evals:

correctness_metric = GEval(..., model="o1")

DeepEval also integrates with these model providers: Ollama, Azure OpenAI, Anthropic, Gemini, etc. To use ANY custom LLM of your choice, check out this part of the docs.

Save Results

It is recommended that you manage your evaluation suite on Confident AI, the deepeval platform.

Confident AI is the deepeval cloud, and helps you build the best LLM evals pipeline. Run deepeval view to view your newly ran test run on the platform:

deepeval view

The deepeval view command requires that the test run that you ran above has been successfully cached locally. If something errors, simply run a new test run after logging in with deepeval login:

deepeval login

After you've pasted in your API key, Confident AI will generate testing reports and automate regression testing whenever you run a test run to evaluate your LLM application inside any environment, at any scale, anywhere.

Watch Full Guide on Confident AI

Once you've ran more than one test run, you'll be able to use the regression testing page shown near the end of the video. Green rows indicate that your LLM has shown improvement on specific test cases, whereas red rows highlight areas of regression.

Test Runs With LLM Tracing

While end-to-end evals treat your LLM app as a black-box, you also evaluate individual components within your LLM app through LLM tracing. This is the recommended way to evaluate AI agents.

LLM Trace

First paste in the following code:

main.py
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric

# 1. Decorate your app
@observe()
def llm_app(input: str):
# 2. Decorate components with metrics you wish to evaluate or debug
@observe(metrics=[AnswerRelevancyMetric()])
def inner_component():
# 3. Create test case at runtime
update_current_span(test_case=LLMTestCase(input="Why is the blue sky?", actual_output="You mean why is the sky blue?"))

return inner_component()

# 4. Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="Test input")])

# 5. Loop through dataset
for golden in dataset.evals_iterator():
# 6. Call LLM app
llm_app(golden.input)

Then run python main.py to run a component-level eval:

python main.py

🎉 Congratulations! Your test case should have passed again ✅ Let's breakdown what happened.

  • The @observe decorate tells deepeval where each component is and creates an LLM trace at execution time
  • Any metrics supplied to @observe allows deepeval to evaluate that component based on the LLMTestCase you create
  • In this example AnswerRelevancyMetric() was used to evaluate inner_component()
  • The dataset specifies the goldens which will be used to invoke your llm_app during evaluation, which happens in a simple for loop

Once the for loop has ended, deepeval will aggregate all metrics, test cases in each component, and run evals across them all, before generating the final testing report.

info

When you do LLM tracing using deepeval, you can automatically evals on traces, spans, and threads (conversations) in production. Simply get an API key from Confident AI and set it in the CLI:

CONFIDENT_API_KEY="confident_us..."

deepeval's LLM tracing implementation is non-instrusive, meaning it will not affect any part of your code.

Evals on traces are end-to-end evaluations, where a single LLM interaction is being evaluated.

Trace-Level Evals in Production

Continue With Your Use Case

Tell us what you're building for more tailored onboarding:

*All quickstarts include a guide on how to bring evals to production near the end

Two Modes of LLM Evals

deepeval offers two main modes of evaluation:

Essential Resources

These are things you should definitely learn about:

Full Example

You can find the full example here on our Github.