DeepEval just got a new look πŸŽ‰ Read the announcement to learn more.
Orchestration Frameworks

Pydantic AI

Pydantic AI is a Python framework for building reliable, production-grade applications with Generative AI, providing type safety and validation for agent outputs and LLM interactions.

End-to-End Evals

deepeval allows you to evaluate Pydantic AI agents under a minute.

Configure Pydantic AI

Pass agent_metrics to the ConfidentInstrumentationSettings constructor.

main.py
from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai.instrumentator import (
    ConfidentInstrumentationSettings,
)
from deepeval.metrics import AnswerRelevancyMetric

agent = Agent(
    "openai:gpt-5",
    instructions="You are a helpful assistant.",
    instrument=ConfidentInstrumentationSettings(
        is_test_mode=True,
        agent_metrics=[AnswerRelevancyMetric()]
    ),
)

Run evaluations

Create an EvaluationDataset and invoke your Pydantic AI application for each golden within the evals_iterator() loop to run end-to-end evaluations.

main.py
import asyncio

dataset = EvaluationDataset(
    goldens=[
        Golden(input="What's the weather in Paris?"),
        Golden(input="What's the weather in London?"),
    ]
)

for golden in dataset.evals_iterator():
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)

βœ… Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

Evals in Production

To run online evaluations in production, replace metrics with a metric collection string from Confident AI, and push your Pydantic AI agent to production.

from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai import ConfidentInstrumentationSettings

agent = Agent(
    "openai:gpt-4o-mini",
    system_prompt="Be concise, reply with one sentence.",
    instrument=ConfidentInstrumentationSettings(
        agent_metric_collection="test_collection_1",
    )
)

result = agent.run_sync(
    "What are the LLMs?"
)

On this page