๐Ÿ”ฅ DeepEval 4.0 just got released. Read the announcement.
Orchestration Frameworks

LangChain

Native Instrumentation
Evals in CI/CD
Evals with Traceability

LangChain is an open-source framework for building LLM applications with models, prompts, tools, retrievers, and agents (via create_agent).

The deepeval integration traces LangChain runs through a CallbackHandler that you pass into LangChain's config. Every agent run, model call, tool call, and retriever call becomes a span you can inspect, without rewriting your LangChain app.

deepeval's LangChain integration enables you to:

  • Trace any LangChain run โ€” pass CallbackHandler(...) through config={"callbacks": [...]} per call.
  • Evaluate traces or individual components with deepeval metrics.
  • Run evals from scripts or CI/CD โ€” same callback, different surfaces.
  • Customize trace and span data through callback kwargs, LangChain metadata, and deepeval's tool decorator.

Getting Started

Installation

pip install -U deepeval langchain langchain-openai

LangChain is instrumented per-call: you decide which runs are traced by passing CallbackHandler(...) into LangChain's runtime config.

Instrument and evaluate

Create a CallbackHandler and pass it to the agent's invoke method.

langchain_agent.py
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[multiply],
    system_prompt="Be concise.",
)

# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What is 8 multiplied by 6?")])

# The `TaskCompletionMetric` is passed into the LangChain callback.
for golden in dataset.evals_iterator():
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]},
    )

Done โœ…. You've run your first eval with full traceability into LangChain via deepeval.

What gets traced

Each LangChain call that receives a CallbackHandler produces a trace โ€” the end-to-end unit your user observes. Inside that trace are component spans for each callback LangChain emits:

  • Agent spans โ€” create_agent(...) runs and any nested runnable steps.
  • LLM spans โ€” chat model and completion calls.
  • Tool spans โ€” tool calls and function executions.
  • Retriever spans โ€” retriever calls, when your app uses retrieval.
Trace                           โ† what the user observes
โ””โ”€โ”€ Agent: math_agent            โ† one create_agent invoke(...) call
    โ”œโ”€โ”€ LLM: gpt-4o-mini        โ† component span: model chooses a tool
    โ”œโ”€โ”€ Tool: multiply          โ† component span: tool input + output
    โ””โ”€โ”€ LLM: gpt-4o-mini        โ† component span: final answer

The trace and its component spans are independently evaluable.

Running evals

There are two surfaces for running evals against a LangChain app. Pick by where you want results to surface โ€” your terminal during development, or your CI pipeline as a pass/fail gate.

In CI/CD (pytest)

Use the deepeval pytest integration. Each parametrized test invocation becomes one LangChain run; failing metrics fail the test, which fails the build.

test_langchain_agent.py
import pytest
from langchain.agents import create_agent
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")
dataset = EvaluationDataset(goldens=[
    Golden(input="What is 8 multiplied by 6?"),
    Golden(input="What is 7 multiplied by 9?"),
])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_langchain_agent(golden: Golden):
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler()]},
    )
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])

Run it with:

deepeval test run test_langchain_agent.py

In a script

Use EvaluationDataset + evals_iterator(...). Each Golden becomes one LangChain run; metrics score the resulting trace through the callback.

langchain_agent.py
dataset = EvaluationDataset(goldens=[
    Golden(input="What is 8 multiplied by 6?"),
    Golden(input="What is 7 multiplied by 9?"),
])

for golden in dataset.evals_iterator():
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]},
    )

Applying metrics to components

Passing metrics=[...] to CallbackHandler evaluates the overall LangChain run. To evaluate a component instead, attach metrics where LangChain creates that component.

Agent spans (sub-agents)

Wrap the invocation in with next_agent_span(metrics=[...]):. The CallbackHandler drains the staged metric onto the first agent span it opens inside the with block โ€” useful for scoring a sub-agent (e.g. an agent invoked as a tool, or a nested create_agent run) in isolation.

langchain_agent.py
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import next_agent_span
...

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")

for golden in dataset.evals_iterator():
    with next_agent_span(metrics=[TaskCompletionMetric()]):
        agent.invoke(
            {"messages": [{"role": "user", "content": golden.input}]},
            config={"callbacks": [CallbackHandler()]},
        )

The same one-shot semantic as next_llm_span applies: only the first agent span in the run picks up the staged metric.

LLM calls

Wrap the invocation in with next_llm_span(metrics=[...]):. The CallbackHandler drains the staged metric onto the first LLM span it opens inside the with block; later LLM calls in the same run get nothing. This is the same one-shot semantic used by next_*_span in the Pydantic AI / Strands / AgentCore / Google ADK integrations.

langchain_agent.py
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import next_llm_span
...

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")

for golden in dataset.evals_iterator():
    with next_llm_span(metrics=[AnswerRelevancyMetric()]):
        agent.invoke(
            {"messages": [{"role": "user", "content": golden.input}]},
            config={"callbacks": [CallbackHandler()]},
        )

For deterministic tool calls, use tool spans for traceability, inputs, outputs, and metadata. Avoid attaching metrics directly to tool spans.

Retriever calls

Wrap the invocation in with next_retriever_span(...) to stage a metric (or a Confident AI metric_collection) on the first retriever span LangChain opens inside the with block.

langchain_agent.py
from deepeval.integrations.langchain import CallbackHandler
from deepeval.tracing import next_retriever_span
...

for golden in dataset.evals_iterator():
    with next_retriever_span(metric_collection="retriever_v1"):
        chain.invoke(
            {"messages": [{"role": "user", "content": golden.input}]},
            config={"callbacks": [CallbackHandler()]},
        )

next_retriever_span accepts the same metrics=[...] / metric_collection=... kwargs as next_llm_span. The same one-shot semantic applies: only the first retriever span in the run picks up the staged config.

Customizing trace and span data

LangChain is instrumented per-call through callbacks, so customization happens at the callback or span-staging boundary.

  • Use CallbackHandler(...) kwargs for trace-level defaults like name, tags, metadata, thread_id, and user_id.
  • Use next_agent_span(...) / next_llm_span(...) / next_retriever_span(...) / next_tool_span(...) to stage component-level fields (metrics, metric collections, test cases, custom span metadata) onto the next span the callback opens.
  • Use tool spans for deterministic traceability, inputs, outputs, and metadata.
langchain_agent.py
callback = CallbackHandler(
    name="math-agent",
    tags=["langchain", "math"],
    metadata={"team": "support"},
    user_id="user-123",
)

agent.invoke(
    {"messages": [{"role": "user", "content": "What is 8 multiplied by 6?"}]},
    config={"callbacks": [callback]},
)

Advanced patterns

The primitives above โ€” CallbackHandler(...), next_*_span(...), and deepeval's tool decorator โ€” compose around one boundary: LangChain owns the callback lifecycle, and your code chooses where to stage component config for the next span the callback opens.

Evaluate subagents with next_*_span

next_*_span(metrics=[...]) stages a metric for the next matching span the CallbackHandler opens. Use this when you want to evaluate a subagent or model step instead of the full run. Pick the helper that matches the span you want to score: next_agent_span(...) or next_llm_span(...).

langchain_agent.py
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import next_agent_span
...

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")

def run_agent(prompt: str):
    with next_agent_span(metrics=[TaskCompletionMetric()]):
        return agent.invoke(
            {"messages": [{"role": "user", "content": prompt}]},
            config={"callbacks": [CallbackHandler()]},
        )

No trace-level metrics required

Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because the TaskCompletionMetric is staged for the agent span, so CI/CD and scripts only need to run the agent inside the staging block.

This is how you'd run it:

test_langchain_agent.py
import pytest
from deepeval import assert_test
...

@pytest.mark.parametrize("golden", dataset.goldens)
def test_component_metrics(golden: Golden):
    run_agent(golden.input)
    assert_test(golden=golden)
deepeval test run test_langchain_agent.py
langchain_agent.py
...

for golden in dataset.evals_iterator():
    run_agent(golden.input)

Wrap a LangChain run in @observe

When the LangChain call is part of a larger operation, decorate the outer function with @observe. LangChain spans nest under your observed span when the callback runs inside it.

langchain_agent.py
from deepeval.tracing import observe
...

@observe(name="respond_to_user")
def respond_to_user(prompt: str) -> str:
    result = agent.invoke(
        {"messages": [{"role": "user", "content": prompt}]},
        config={"callbacks": [CallbackHandler()]},
    )
    return result["messages"][-1].content

API reference

CallbackHandler(...) accepts the following trace-level kwargs. Each one is a default for runs that use that callback.

KwargTypeDescription
namestrDefault trace name.
tagslist[str]Tags applied to traces produced by this callback.
metadatadictTrace metadata applied when the callback starts a trace.
thread_idstrGroups related runs into a single trace thread.
user_idstrActor identifier for the trace.
metricslistMetrics applied to the LangChain run.
metric_collectionstrMetric collection applied to the LangChain run.
test_case_idstrOptional test case identifier.
turn_idstrOptional turn identifier for conversational traces.

For native tracing helpers (@observe, with trace(...), update_current_trace, update_current_span) see the tracing reference.

FAQs

Can I evaluate a sub-agent inside my LangChain agent run?
Yes. Stage a metric with with next_agent_span(metrics=[...]) right before agent.invoke(...), and the CallbackHandler drains it onto that sub-agent's span โ€” scoring the sub-agent in isolation without touching the parent. It's one-shot per run, so to score every step you drive the loop yourself or use trace-level metrics on CallbackHandler(metrics=[...]).
Can I gate CI/CD on my LangChain agent's metrics?
Yes. Pass a CallbackHandler() into the agent's config inside a parametrized pytest test, then call assert_test(golden=golden, metrics=[...]) and run deepeval test run so a failing metric fails the build.
Can I see these LangChain traces in a cloud UI?
Yes, optionally. After deepeval login, Confident AI renders every agent, LLM, tool, and retriever span produced by the CallbackHandler in a shared dashboard โ€” no code changes.
Can I monitor a LangChain app in production?
Yes. Keep the CallbackHandler in your production calls and set thread_id / user_id for grouping; when logged into Confident AI those live traces support online evals on real traffic.

On this page