LangGraph
LangGraph is an open-source framework for developing applications powered by large language models, enabling chaining of LLMs with external data sources and expressive workflows to build advanced generative AI solutions.
End-to-End Evals
deepeval allows you to evaluate LangGraph applications end-to-end in under a minute.
Configure LangGraph
Create a CallbackHandler with a list of task completion metrics you wish to use, and pass it to your LangGraph application's invoke method.
from langgraph.prebuilt import create_react_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
def get_weather(city: str) -> str:
"""Returns the weather in a city"""
return f"It's always sunny in {city}!"
agent = create_react_agent(
model="openai:gpt-4o-mini",
tools=[get_weather],
prompt="You are a helpful assistant",
)
#result = agent.invoke(
# input = {"messages": [{"role": "user", "content": "what is the weather in sf"}]},
# config = {"callbacks": [CallbackHandler(metrics=[task_completion_metric])]}
#)
#print(result)Run evaluations
Create an EvaluationDataset and invoke your LangGraph application for each golden within the evals_iterator() loop to run end-to-end evaluations.
from deepeval.dataset import Golden, EvaluationDataset
goldens = [
Golden(input="What is the weather in Bogotá, Colombia?"),
Golden(input="What is the weather in Paris, France?"),
]
dataset = EvaluationDataset(goldens=goldens)
for golden in dataset.evals_iterator():
agent.invoke(
input={"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
)import asyncio
from deepeval.dataset import Golden, EvaluationDataset
dataset = EvaluationDataset(goldens=[
Golden(input="What is the weather in Bogotá, Colombia?"),
Golden(input="What is the weather in Paris, France?"),
])
for golden in dataset.evals_iterator():
task = asyncio.create_task(
agent.ainvoke(
input={"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
)
)
dataset.evaluate(task)âś… Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.
Component-level Evals
Using deepeval, you can now evaluate individual components of your LangGraph application.
LLM
Define metrics in the metadata of the all the BaseLanguageModels in your LangGraph application.
from langchain_openai import ChatOpenAI
from deepeval.metrics import AnswerRelevancyMetric
...
llm = ChatOpenAI(
model="gpt-4o-mini",
metadata={"metric": [AnswerRelevancyMetric()]}
).bind_tools([get_weather])Tool
To pass metrics to the tools, you can use the DeepEval's LangChain tool decorator.
# from langchain_core.tools import tool
from deepeval.integrations.langchain import tool
from deepeval.metrics import AnswerRelevancyMetric
...
@tool(metric=[AnswerRelevancyMetric()])
def get_weather(location: str) -> str:
"""Get the current weather in a location."""
return f"It's always sunny in {location}!"Evals in Production
To run online evaluations in production, simply replace metrics in CallbackHandler with a metric collection string from Confident AI, and push your LangChain agent to production.
result = agent_executor.invoke(
{"input": "What is 8 multiplied by 6?"},
config={"callbacks": [CallbackHandler(metric_collection="<metric-collection-name-with-task-completion>")]}
)