LangChain
LangChain is an open-source framework for building LLM applications with models, prompts, tools, retrievers, and agents (via create_agent).
The deepeval integration traces LangChain runs through a CallbackHandler that you pass into LangChain's config. Every agent run, model call, tool call, and retriever call becomes a span you can inspect, without rewriting your LangChain app.
deepeval's LangChain integration enables you to:
- Trace any LangChain run โ pass
CallbackHandler(...)throughconfig={"callbacks": [...]}per call. - Evaluate traces or individual components with
deepevalmetrics. - Run evals from scripts or CI/CD โ same callback, different surfaces.
- Customize trace and span data through callback kwargs, LangChain metadata, and
deepeval's tool decorator.
Getting Started
Installation
pip install -U deepeval langchain langchain-openaiLangChain is instrumented per-call: you decide which runs are traced by passing CallbackHandler(...) into LangChain's runtime config.
Instrument and evaluate
Create a CallbackHandler and pass it to the agent's invoke method.
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
def multiply(a: int, b: int) -> int:
"""Multiply two numbers."""
return a * b
agent = create_agent(
model="openai:gpt-4o-mini",
tools=[multiply],
system_prompt="Be concise.",
)
# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What is 8 multiplied by 6?")])
# The `TaskCompletionMetric` is passed into the LangChain callback.
for golden in dataset.evals_iterator():
agent.invoke(
{"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]},
)Done โ
. You've run your first eval with full traceability into LangChain via deepeval.
What gets traced
Each LangChain call that receives a CallbackHandler produces a trace โ the end-to-end unit your user observes. Inside that trace are component spans for each callback LangChain emits:
- Agent spans โ
create_agent(...)runs and any nested runnable steps. - LLM spans โ chat model and completion calls.
- Tool spans โ tool calls and function executions.
- Retriever spans โ retriever calls, when your app uses retrieval.
Trace โ what the user observes
โโโ Agent: math_agent โ one create_agent invoke(...) call
โโโ LLM: gpt-4o-mini โ component span: model chooses a tool
โโโ Tool: multiply โ component span: tool input + output
โโโ LLM: gpt-4o-mini โ component span: final answerThe trace and its component spans are independently evaluable.
Running evals
There are two surfaces for running evals against a LangChain app. Pick by where you want results to surface โ your terminal during development, or your CI pipeline as a pass/fail gate.
In CI/CD (pytest)
Use the deepeval pytest integration. Each parametrized test invocation becomes one LangChain run; failing metrics fail the test, which fails the build.
import pytest
from langchain.agents import create_agent
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
def multiply(a: int, b: int) -> int:
"""Multiply two numbers."""
return a * b
agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")
dataset = EvaluationDataset(goldens=[
Golden(input="What is 8 multiplied by 6?"),
Golden(input="What is 7 multiplied by 9?"),
])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_langchain_agent(golden: Golden):
agent.invoke(
{"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler()]},
)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Run it with:
deepeval test run test_langchain_agent.pyIn a script
Use EvaluationDataset + evals_iterator(...). Each Golden becomes one LangChain run; metrics score the resulting trace through the callback.
dataset = EvaluationDataset(goldens=[
Golden(input="What is 8 multiplied by 6?"),
Golden(input="What is 7 multiplied by 9?"),
])
for golden in dataset.evals_iterator():
agent.invoke(
{"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]},
)Applying metrics to components
Passing metrics=[...] to CallbackHandler evaluates the overall LangChain run. To evaluate a component instead, attach metrics where LangChain creates that component.
Agent spans (sub-agents)
Wrap the invocation in with next_agent_span(metrics=[...]):. The CallbackHandler drains the staged metric onto the first agent span it opens inside the with block โ useful for scoring a sub-agent (e.g. an agent invoked as a tool, or a nested create_agent run) in isolation.
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import next_agent_span
...
agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")
for golden in dataset.evals_iterator():
with next_agent_span(metrics=[TaskCompletionMetric()]):
agent.invoke(
{"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler()]},
)The same one-shot semantic as next_llm_span applies: only the first agent span in the run picks up the staged metric.
LLM calls
Wrap the invocation in with next_llm_span(metrics=[...]):. The CallbackHandler drains the staged metric onto the first LLM span it opens inside the with block; later LLM calls in the same run get nothing. This is the same one-shot semantic used by next_*_span in the Pydantic AI / Strands / AgentCore / Google ADK integrations.
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import next_llm_span
...
agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")
for golden in dataset.evals_iterator():
with next_llm_span(metrics=[AnswerRelevancyMetric()]):
agent.invoke(
{"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler()]},
)For deterministic tool calls, use tool spans for traceability, inputs, outputs, and metadata. Avoid attaching metrics directly to tool spans.
Retriever calls
Wrap the invocation in with next_retriever_span(...) to stage a metric (or a Confident AI metric_collection) on the first retriever span LangChain opens inside the with block.
from deepeval.integrations.langchain import CallbackHandler
from deepeval.tracing import next_retriever_span
...
for golden in dataset.evals_iterator():
with next_retriever_span(metric_collection="retriever_v1"):
chain.invoke(
{"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler()]},
)next_retriever_span accepts the same metrics=[...] / metric_collection=... kwargs as next_llm_span. The same one-shot semantic applies: only the first retriever span in the run picks up the staged config.
Customizing trace and span data
LangChain is instrumented per-call through callbacks, so customization happens at the callback or span-staging boundary.
- Use
CallbackHandler(...)kwargs for trace-level defaults likename,tags,metadata,thread_id, anduser_id. - Use
next_agent_span(...)/next_llm_span(...)/next_retriever_span(...)/next_tool_span(...)to stage component-level fields (metrics, metric collections, test cases, custom span metadata) onto the next span the callback opens. - Use tool spans for deterministic traceability, inputs, outputs, and metadata.
callback = CallbackHandler(
name="math-agent",
tags=["langchain", "math"],
metadata={"team": "support"},
user_id="user-123",
)
agent.invoke(
{"messages": [{"role": "user", "content": "What is 8 multiplied by 6?"}]},
config={"callbacks": [callback]},
)Advanced patterns
The primitives above โ CallbackHandler(...), next_*_span(...), and deepeval's tool decorator โ compose around one boundary: LangChain owns the callback lifecycle, and your code chooses where to stage component config for the next span the callback opens.
Evaluate subagents with next_*_span
next_*_span(metrics=[...]) stages a metric for the next matching span the CallbackHandler opens. Use this when you want to evaluate a subagent or model step instead of the full run. Pick the helper that matches the span you want to score: next_agent_span(...) or next_llm_span(...).
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import next_agent_span
...
agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")
def run_agent(prompt: str):
with next_agent_span(metrics=[TaskCompletionMetric()]):
return agent.invoke(
{"messages": [{"role": "user", "content": prompt}]},
config={"callbacks": [CallbackHandler()]},
)No trace-level metrics required
Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because the TaskCompletionMetric is staged for the agent span, so CI/CD and scripts only need to run the agent inside the staging block.
This is how you'd run it:
import pytest
from deepeval import assert_test
...
@pytest.mark.parametrize("golden", dataset.goldens)
def test_component_metrics(golden: Golden):
run_agent(golden.input)
assert_test(golden=golden)deepeval test run test_langchain_agent.py...
for golden in dataset.evals_iterator():
run_agent(golden.input)Wrap a LangChain run in @observe
When the LangChain call is part of a larger operation, decorate the outer function with @observe. LangChain spans nest under your observed span when the callback runs inside it.
from deepeval.tracing import observe
...
@observe(name="respond_to_user")
def respond_to_user(prompt: str) -> str:
result = agent.invoke(
{"messages": [{"role": "user", "content": prompt}]},
config={"callbacks": [CallbackHandler()]},
)
return result["messages"][-1].contentAPI reference
CallbackHandler(...) accepts the following trace-level kwargs. Each one is a default for runs that use that callback.
| Kwarg | Type | Description |
|---|---|---|
name | str | Default trace name. |
tags | list[str] | Tags applied to traces produced by this callback. |
metadata | dict | Trace metadata applied when the callback starts a trace. |
thread_id | str | Groups related runs into a single trace thread. |
user_id | str | Actor identifier for the trace. |
metrics | list | Metrics applied to the LangChain run. |
metric_collection | str | Metric collection applied to the LangChain run. |
test_case_id | str | Optional test case identifier. |
turn_id | str | Optional turn identifier for conversational traces. |
For native tracing helpers (@observe, with trace(...), update_current_trace, update_current_span) see the tracing reference.
FAQs
Can I evaluate a sub-agent inside my LangChain agent run?
with next_agent_span(metrics=[...]) right before agent.invoke(...), and the CallbackHandler drains it onto that sub-agent's span โ scoring the sub-agent in isolation without touching the parent. It's one-shot per run, so to score every step you drive the loop yourself or use trace-level metrics on CallbackHandler(metrics=[...]).Can I gate CI/CD on my LangChain agent's metrics?
CallbackHandler() into the agent's config inside a parametrized pytest test, then call assert_test(golden=golden, metrics=[...]) and run deepeval test run so a failing metric fails the build.Can I see these LangChain traces in a cloud UI?
deepeval login, Confident AI renders every agent, LLM, tool, and retriever span produced by the CallbackHandler in a shared dashboard โ no code changes.Can I monitor a LangChain app in production?
CallbackHandler in your production calls and set thread_id / user_id for grouping; when logged into Confident AI those live traces support online evals on real traffic.