LangGraph

Q: Can I evaluate a sub-agent node inside my graph?

Yes. Wrap graph.invoke(...) in with next_agent_span(metrics=[...]) and the CallbackHandler drains the metric onto the agent span that sub-agent node emits — scoring it on its own. It's one-shot per run, so for every loop turn either drive the loop yourself or score end-to-end with trace-level metrics.

Q: Can I fail CI when a LangGraph metric regresses?

Yes. Pass CallbackHandler() into the graph config inside a parametrized pytest test and assert with assert_test(...) under deepeval test run.

Q: Where do my LangGraph traces show up beyond the console?

Run deepeval login and Confident AI visualizes the full graph trace — every node, model call, and tool call as nested spans — with their scores in a shared cloud UI. It's optional.

Q: Can I keep evaluating a deployed LangGraph app in production?

Yes. Keep passing the CallbackHandler in production and group runs with thread_id; logged into Confident AI those live traces power online evals on real traffic.

Native Instrumentation

Evals in CI/CD

Evals with Traceability

LangGraph is a low-level orchestration framework for building stateful, graph-based agent workflows. You compose agents from StateGraph nodes and edges, with full control over routing, state, and tool execution.

The deepeval integration traces LangGraph runs through LangChain's CallbackHandler, which you pass into your graph's runtime config. Every graph run, node, model call, tool call, and nested step becomes a span you can inspect, without rewriting your LangGraph app.

langgraph_agent · deepeval

$deepeval test run test_langgraph_agent.py

●test_langgraph_agent

│

└─AGENTweather_graphTask Completion0.94190ms

├─LLMchatbot · gpt-4o-miniG-Eval0.4272ms

├─TOOLget_weather(city="Paris")32ms

└─LLMchatbot · gpt-4o-miniFaithfulness0.9578ms

Trace score 0.77 · 2/3 metrics passedfailed

deepeval's LangGraph integration enables you to:

Trace any LangGraph run — pass CallbackHandler(...) through config={"callbacks": [...]} per call.
Evaluate traces or model / agent components with deepeval metrics.
Run evals from scripts or CI/CD — same callback, different surfaces.
Customize trace and span data through callback kwargs and LangChain metadata.

Getting Started

Installation

pip install -U deepeval langgraph langchain-openai

LangGraph uses LangChain's callback system, so the deepeval integration is per-call. You decide which graph runs are traced by passing CallbackHandler(...) into the graph config.

Instrument and evaluate

Wire your StateGraph (LangGraph's core abstraction), then pass CallbackHandler(...) to the invocation you want to evaluate.

langgraph_agent.py

from langchain.chat_models import init_chat_model
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

def get_weather(city: str) -> str:
    """Return the weather in a city."""
    return f"It's always sunny in {city}!"

llm = init_chat_model("openai:gpt-4o-mini").bind_tools([get_weather])

def chatbot(state: MessagesState):
    return {"messages": [llm.invoke(state["messages"])]}

graph = (
    StateGraph(MessagesState)
    .add_node(chatbot)
    .add_node("tools", ToolNode([get_weather]))
    .add_edge(START, "chatbot")
    .add_conditional_edges("chatbot", tools_condition)
    .add_edge("tools", "chatbot")
    .compile()
)

# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What is the weather in Paris?")])

# The `TaskCompletionMetric` is passed into the LangGraph callback.
for golden in dataset.evals_iterator():
    graph.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]},
    )

Done ✅. You've run your first eval with full traceability into LangGraph via deepeval.

What gets traced

Each LangGraph run that receives a CallbackHandler produces a trace — the end-to-end unit your user observes. Inside that trace are component spans for each callback LangGraph emits through LangChain:

Graph / node spans — the compiled StateGraph invocation and each node it dispatches to.
LLM spans — chat model and completion calls inside a node.
Tool spans — tool calls executed by ToolNode (or your own).
Retriever spans — retriever calls, when your graph uses retrieval.

Trace                           ← what the user observes
└── Graph: weather_graph         ← one graph invoke(...) call
    ├── Node: chatbot           ← model picks a tool
    │   └── LLM: gpt-4o-mini
    ├── Node: tools             ← ToolNode runs the tool
    │   └── Tool: get_weather
    └── Node: chatbot           ← model writes the final answer
        └── LLM: gpt-4o-mini

The trace and its component spans are independently evaluable.

Running evals

There are two surfaces for running evals against a LangGraph app. Pick by where you want results to surface — your terminal during development, or your CI pipeline as a pass/fail gate.

In CI/CD (pytest)

Use the deepeval pytest integration. Each parametrized test invocation becomes one LangGraph run; failing metrics fail the test, which fails the build.

test_langgraph_agent.py

import pytest
from langchain.chat_models import init_chat_model
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

def get_weather(city: str) -> str:
    """Return the weather in a city."""
    return f"It's always sunny in {city}!"

llm = init_chat_model("openai:gpt-4o-mini").bind_tools([get_weather])

def chatbot(state: MessagesState):
    return {"messages": [llm.invoke(state["messages"])]}

graph = (
    StateGraph(MessagesState)
    .add_node(chatbot)
    .add_node("tools", ToolNode([get_weather]))
    .add_edge(START, "chatbot")
    .add_conditional_edges("chatbot", tools_condition)
    .add_edge("tools", "chatbot")
    .compile()
)

dataset = EvaluationDataset(goldens=[
    Golden(input="What is the weather in Paris?"),
    Golden(input="What is the weather in London?"),
])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_langgraph_agent(golden: Golden):
    graph.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler()]},
    )
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])

Run it with:

deepeval test run test_langgraph_agent.py

In a script

Use EvaluationDataset + evals_iterator(...). Each Golden becomes one LangGraph run; metrics score the resulting trace through the callback.

langgraph_agent.py

dataset = EvaluationDataset(goldens=[
    Golden(input="What is the weather in Paris?"),
    Golden(input="What is the weather in London?"),
])

for golden in dataset.evals_iterator():
    graph.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]},
    )

Applying metrics to components

Passing metrics=[...] to CallbackHandler evaluates the overall LangGraph run. To evaluate a component instead, attach metrics where the graph creates that component.

Agent spans (sub-agents)

Wrap the graph.invoke(...) in with next_agent_span(metrics=[...]):. The CallbackHandler drains the staged metric onto the first agent span the graph emits — useful for scoring a sub-agent node or subgraph in isolation.

langgraph_agent.py

from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import next_agent_span
...

for golden in dataset.evals_iterator():
    with next_agent_span(metrics=[TaskCompletionMetric()]):
        graph.invoke(
            {"messages": [{"role": "user", "content": golden.input}]},
            config={"callbacks": [CallbackHandler()]},
        )

The same one-shot semantic as next_llm_span applies: only the first agent span in the graph run picks up the staged metric.

LLM calls

Wrap the graph.invoke(...) in with next_llm_span(metrics=[...]):. The CallbackHandler drains the staged metric onto the first LLM span the graph emits; later LLM calls on subsequent loop turns get nothing. This is the same one-shot semantic used by next_*_span in the Pydantic AI / Strands / AgentCore / Google ADK integrations.

langgraph_agent.py

from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import next_llm_span
...

for golden in dataset.evals_iterator():
    with next_llm_span(metrics=[AnswerRelevancyMetric()]):
        graph.invoke(
            {"messages": [{"role": "user", "content": golden.input}]},
            config={"callbacks": [CallbackHandler()]},
        )

For deterministic tool calls, use tool spans for traceability, inputs, outputs, and metadata. Avoid attaching metrics directly to tool spans.

Customizing trace and span data

LangGraph is instrumented per-call through LangChain callbacks, so customization happens at the callback or span-staging boundary.

Use CallbackHandler(...) kwargs for trace-level defaults like name, tags, metadata, thread_id, and user_id.
Use next_agent_span(...) / next_llm_span(...) / next_retriever_span(...) / next_tool_span(...) to stage component-level fields (metrics, metric collections, test cases, custom span metadata) onto the next span the callback opens.
Use tool spans for deterministic traceability, inputs, outputs, and metadata.

langgraph_agent.py

callback = CallbackHandler(
    name="weather-graph",
    tags=["langgraph", "weather"],
    metadata={"team": "support"},
    user_id="user-123",
)

graph.invoke(
    {"messages": [{"role": "user", "content": "What is the weather in Paris?"}]},
    config={"callbacks": [callback]},
)

Advanced patterns

The primitives above — CallbackHandler(...) and next_*_span(...) — compose around one boundary: LangGraph owns the graph execution lifecycle, and your code chooses where to stage component config for the next span the callback opens.

Evaluate subagents with `next_*_span`

next_*_span(metrics=[...]) stages a metric for the next matching span the CallbackHandler opens during the graph run. Use this when you want to evaluate a subagent node or model step instead of the full graph. Pick the helper that matches the span you want to score: next_agent_span(...) or next_llm_span(...).

langgraph_agent.py

from langchain.chat_models import init_chat_model
from langgraph.graph import StateGraph, MessagesState, START, END
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import next_agent_span
...

llm = init_chat_model("openai:gpt-4o-mini")

def chatbot(state: MessagesState):
    return {"messages": [llm.invoke(state["messages"])]}

graph = (
    StateGraph(MessagesState)
    .add_node(chatbot)
    .add_edge(START, "chatbot")
    .add_edge("chatbot", END)
    .compile()
)

def run_graph(prompt: str):
    with next_agent_span(metrics=[TaskCompletionMetric()]):
        return graph.invoke(
            {"messages": [{"role": "user", "content": prompt}]},
            config={"callbacks": [CallbackHandler()]},
        )

No trace-level metrics required

Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because the TaskCompletionMetric is staged for the agent span, so CI/CD and scripts only need to run the graph inside the staging block.

This is how you'd run it:

test_langgraph_agent.py

import pytest
from deepeval import assert_test
...

@pytest.mark.parametrize("golden", dataset.goldens)
def test_component_metrics(golden: Golden):
    run_graph(golden.input)
    assert_test(golden=golden)

deepeval test run test_langgraph_agent.py

langgraph_agent.py

...

for golden in dataset.evals_iterator():
    run_graph(golden.input)

Wrap a LangGraph run in `@observe`

When the LangGraph call is part of a larger operation, decorate the outer function with @observe. LangGraph spans nest under your observed span when the callback runs inside it.

langgraph_agent.py

from deepeval.tracing import observe
...

@observe(name="respond_to_user")
def respond_to_user(prompt: str):
    return graph.invoke(
        {"messages": [{"role": "user", "content": prompt}]},
        config={"callbacks": [CallbackHandler()]},
    )

API reference

CallbackHandler(...) accepts the following trace-level kwargs. Each one is a default for runs that use that callback.

Kwarg	Type	Description
`name`	`str`	Default trace name.
`tags`	`list[str]`	Tags applied to traces produced by this callback.
`metadata`	`dict`	Trace metadata applied when the callback starts a trace.
`thread_id`	`str`	Groups related runs into a single trace thread.
`user_id`	`str`	Actor identifier for the trace.
`metrics`	`list`	Metrics applied to the LangGraph run.
`metric_collection`	`str`	Metric collection applied to the LangGraph run.
`test_case_id`	`str`	Optional test case identifier.
`turn_id`	`str`	Optional turn identifier for conversational traces.

For native tracing helpers (@observe, with trace(...), update_current_trace, update_current_span) see the tracing reference.

FAQs

Can I evaluate a sub-agent node inside my graph?

Yes. Wrap graph.invoke(...) in with next_agent_span(metrics=[...]) and the CallbackHandler drains the metric onto the agent span that sub-agent node emits — scoring it on its own. It's one-shot per run, so for every loop turn either drive the loop yourself or score end-to-end with trace-level metrics.

Can I fail CI when a LangGraph metric regresses?

Yes. Pass CallbackHandler() into the graph config inside a parametrized pytest test and assert with assert_test(...) under deepeval test run.

Where do my LangGraph traces show up beyond the console?

Run deepeval login and Confident AI visualizes the full graph trace — every node, model call, and tool call as nested spans — with their scores in a shared cloud UI. It's optional.

Can I keep evaluating a deployed LangGraph app in production?

Yes. Keep passing the CallbackHandler in production and group runs with thread_id; logged into Confident AI those live traces power online evals on real traffic.

On this page