OpenAI Agents
OpenAI Agents is OpenAI's Python SDK for building agents that reason, call tools, and hand off to other agents.
The deepeval integration plugs into the agents SDK's tracing pipeline as a TracingProcessor. Every Runner.run(...), agent step, LLM call, and tool call becomes a span you can inspect โ without rewriting your agent code.
deepeval's OpenAI Agents integration enables you to:
- Trace every
Runner.run(...)โ each agent run produces a trace, and each LLM, tool, and sub-agent call becomes a component span. - Attach metrics directly to
Agentandfunction_toolwithagent_metrics,llm_metrics, andmetrics=on tools. - Run evals from scripts or CI/CD โ same agent, different surfaces.
- Compose with
@observeandwith trace(...)to evaluate larger flows that wrap one or more agent runs.
Getting Started
Installation
pip install -U deepeval openai-agentsThe integration registers DeepEvalTracingProcessor against the agents SDK's tracing pipeline, then provides Agent and function_tool shims that accept deepeval metrics directly.
Instrument and evaluate
Register the processor once at startup, then use deepeval.openai_agents.Agent and function_tool in place of the SDK's classes. Attach metrics to the agent or to specific tools.
from agents import Runner, add_trace_processor
from deepeval.openai_agents import Agent, DeepEvalTracingProcessor, function_tool
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
add_trace_processor(DeepEvalTracingProcessor())
@function_tool
def get_weather(city: str) -> str:
"""Return the weather in a city."""
return f"It's always sunny in {city}!"
agent = Agent(
name="weather_agent",
instructions="Answer weather questions concisely.",
tools=[get_weather],
agent_metrics=[TaskCompletionMetric()],
)
# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What's the weather in Paris?")])
for golden in dataset.evals_iterator():
Runner.run_sync(agent, golden.input)Done โ
. You've run your first eval with full traceability into OpenAI Agents via deepeval.
What gets traced
Each Runner.run(...) call produces a trace โ the end-to-end unit your user observes. Inside that trace are component spans for every step the agent took:
- Agent spans โ one per
Agentinvocation, including handoffs to other agents. - LLM spans โ model calls (Responses API and Chat Completions).
- Tool spans โ
function_tool,MCPListTools, and other agents-SDK tool calls.
Trace โ what the user observes
โโโ Agent: weather_agent โ one Runner.run(...) call
โโโ LLM: gpt-4o โ component span: model plans
โโโ Tool: get_weather โ component span: tool input + output
โโโ LLM: gpt-4o โ component span: final answerThe trace and its component spans are independently evaluable.
Running evals
There are two surfaces for running evals against an OpenAI Agents app. Pick by where you want results to surface โ your terminal during development, or your CI pipeline as a pass/fail gate.
In CI/CD (pytest)
Use the deepeval pytest integration. Each parametrized test invocation becomes one agent run; failing metrics fail the test, which fails the build.
import pytest
from agents import Runner, add_trace_processor
from deepeval import assert_test
from deepeval.openai_agents import Agent, DeepEvalTracingProcessor, function_tool
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
add_trace_processor(DeepEvalTracingProcessor())
@function_tool
def get_weather(city: str) -> str:
"""Return the weather in a city."""
return f"It's always sunny in {city}!"
agent = Agent(
name="weather_agent",
instructions="Answer weather questions concisely.",
tools=[get_weather],
)
dataset = EvaluationDataset(goldens=[
Golden(input="What's the weather in Paris?"),
Golden(input="What's the weather in London?"),
])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_openai_agents_app(golden: Golden):
Runner.run_sync(agent, golden.input)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Run it with:
deepeval test run test_openai_agents_app.pyIn a script
Use EvaluationDataset + evals_iterator(...). Each Golden becomes one agent run; metrics score the resulting trace.
import asyncio
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.metrics import TaskCompletionMetric
...
dataset = EvaluationDataset(goldens=[
Golden(input="What's the weather in Paris?"),
Golden(input="What's the weather in London?"),
])
for golden in dataset.evals_iterator(
async_config=AsyncConfig(run_async=True),
metrics=[TaskCompletionMetric()],
):
task = asyncio.create_task(Runner.run(agent, golden.input))
dataset.evaluate(task)Sync (Runner.run_sync) and async (Runner.run) execution both work; pick whichever matches your code.
Applying metrics to components
The metrics=[...] you pass to evals_iterator evaluates the trace. To evaluate a component โ a specific agent, LLM call, or tool โ attach metrics directly to the agent or tool.
Agent spans
Use agent_metrics=[...] on deepeval.openai_agents.Agent. The metric is applied to that agent's span on every run, including when it's invoked as a sub-agent through a handoff.
from deepeval.openai_agents import Agent
from deepeval.metrics import TaskCompletionMetric
...
agent = Agent(
name="weather_agent",
instructions="Answer weather questions concisely.",
tools=[get_weather],
agent_metrics=[TaskCompletionMetric()],
)LLM calls
Use llm_metrics=[...] on Agent. The metric is applied to the LLM span produced for that agent's model calls. Useful when you want to score the model's reasoning step in isolation.
from deepeval.openai_agents import Agent
from deepeval.metrics import AnswerRelevancyMetric
...
agent = Agent(
name="weather_agent",
instructions="Answer weather questions concisely.",
tools=[get_weather],
llm_metrics=[AnswerRelevancyMetric()],
)Tool calls
Pass metrics=[...] to function_tool to evaluate a specific tool's behavior. Useful for tools that return non-deterministic content (e.g. retrieval, summarization tools).
from deepeval.openai_agents import function_tool
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
@function_tool(metrics=[GEval(
name="Helpful Weather Lookup",
criteria="The output must be a clear weather summary for the requested city.",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)])
def get_weather(city: str) -> str:
"""Return the weather in a city."""
return f"It's always sunny in {city}!"For deterministic tool calls, prefer update_current_span(...) to add metadata, inputs, and outputs instead of attaching metrics to the tool span.
Customizing trace and span data
The integration captures inputs, outputs, model names, and tool calls automatically. For anything dynamic, the right API depends on where your code runs.
- Use
with trace(...)for trace-level fields (name,tags,metadata,thread_id,user_id). - Use
Agent/function_toolkwargs (agent_metrics,llm_metrics,metrics=,confident_prompt) for component-level defaults. - Use
update_current_trace(...)andupdate_current_span(...)from inside a tool body to mutate fields the framework can't see.
from deepeval.openai_agents import function_tool
from deepeval.tracing import update_current_trace, update_current_span
@function_tool
def get_weather(city: str) -> str:
"""Return the weather in a city."""
update_current_trace(metadata={"city": city})
update_current_span(metadata={"source": "static-table"})
return f"It's always sunny in {city}!"Advanced patterns
The primitives above โ Agent, function_tool, add_trace_processor, @observe, with trace(...) โ compose around one boundary: the agents SDK owns the run lifecycle, and your code attaches metrics where they make sense.
Evaluate a sub-agent through handoff
When a parent agent hands off to a sub-agent, the sub-agent's span runs as a child of the parent's. Attaching agent_metrics to the sub-agent scores that hand-off step in isolation.
from deepeval.openai_agents import Agent
from deepeval.metrics import TaskCompletionMetric, AnswerRelevancyMetric
...
triage_agent = Agent(
name="triage",
instructions="Route the question to the right specialist.",
handoffs=[
Agent(
name="weather_specialist",
instructions="Answer weather questions.",
tools=[get_weather],
agent_metrics=[TaskCompletionMetric()],
),
],
agent_metrics=[AnswerRelevancyMetric()],
)No trace-level metrics required
Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because the metrics are already attached to the triage and specialist agents, so CI/CD and scripts only need to run the agent.
This is how you'd run it:
import pytest
from agents import Runner
from deepeval import assert_test
...
@pytest.mark.parametrize("golden", dataset.goldens)
def test_triage_agent(golden: Golden):
Runner.run_sync(triage_agent, golden.input)
assert_test(golden=golden)deepeval test run test_openai_agents_app.pyimport asyncio
...
for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
task = asyncio.create_task(Runner.run(triage_agent, golden.input))
dataset.evaluate(task)Wrap an agent run in @observe
When the agent run is part of a larger operation, decorate the outer function with @observe. The agents-SDK spans nest under your observed span automatically.
from agents import Runner
from deepeval.tracing import observe
...
@observe(name="respond_to_user")
async def respond_to_user(prompt: str) -> str:
result = await Runner.run(agent, prompt)
return result.final_output.strip()Bind a Confident AI prompt to an agent
Pass confident_prompt= to attach a Confident AI Prompt to every LLM span produced by that agent. Prompt analytics (commit hash, version, label) flow with the trace.
from deepeval.openai_agents import Agent
from deepeval.prompt import Prompt
prompt = Prompt(alias="weather-system")
prompt.pull(version="latest")
agent = Agent(
name="weather_agent",
instructions=prompt.interpolate(),
tools=[get_weather],
confident_prompt=prompt,
)API reference
deepeval.openai_agents.Agent(...) accepts the SDK's standard Agent kwargs plus the following deepeval-specific ones:
| Kwarg | Type | Description |
|---|---|---|
agent_metrics | list | Metrics applied to this agent's span on every run. |
llm_metrics | list | Metrics applied to LLM spans produced by this agent's model calls. |
confident_prompt | Prompt | Confident AI prompt object; captured on every LLM span produced by this agent. |
function_tool(..., metrics=[...]) accepts the SDK's standard kwargs plus metrics, applied to that tool's span on every call.
For runtime helpers (update_current_trace, update_current_span) and the test-decorator surface (@observe, @assert_test, with trace(...)), see the tracing reference.
FAQs
Can I evaluate a sub-agent reached through a handoff?
agent_metrics=[...] to the sub-agent's deepeval.openai_agents.Agent. The metric runs on that agent's span every time it's invoked โ including when a parent agent hands off to it โ so the hand-off step is scored on its own.Can I make these agent evals a CI/CD gate?
DeepEvalTracingProcessor() once, build your agent with the shim, then run Runner.run(...) in a parametrized pytest test and assert with assert_test(...) under deepeval test run.Where can I view these agent traces beyond the terminal?
deepeval login and Confident AI renders each Runner.run(...) as a trace โ agent, LLM, and tool spans, including handoffs โ with their scores in a shared cloud UI. It's optional.Can I monitor an OpenAI Agents app in production?
DeepEvalTracingProcessor registered and a Confident AI login, live agent runs stream in for online evals on real traffic, and confident_prompt ties prompt versions to each trace.