DeepEval just got a new look 🎉 Read the announcement to learn more.
Orchestration Frameworks

LangChain

LangChain is an open-source framework for developing applications powered by large language models, enabling chaining of LLMs with external data sources and expressive workflows to build advanced generative AI solutions.

End-to-End Evals

deepeval allows you to evaluate LangChain applications end-to-end in under a minute.

Configure LangChain

Create a CallbackHandler with a list of task completion metrics you wish to use, and pass it to your LangChain application's invoke method.

main.py
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

from deepeval.integrations.langchain import CallbackHandler

from deepeval.metrics import TaskCompletionMetric

@tool
def multiply(a: int, b: int) -> int:
    """Returns the product of two numbers"""
    return a * b

llm = ChatOpenAI(model="gpt-4o-mini")

agent_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that can perform mathematical operations."),
    ("human", "{input}"),
    MessagesPlaceholder("agent_scratchpad"),
])

agent = create_tool_calling_agent(llm, [multiply], agent_prompt)
agent_executor = AgentExecutor(agent=agent, tools=[multiply], verbose=True)

# result = agent_executor.invoke(
#    {"input": "What is 8 multiplied by 6?"},
#    config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
#)

#print(result)

Run evaluations

Create an EvaluationDataset and invoke your LangChain application for each golden within the evals_iterator() loop to run end-to-end evaluations.

main.py
from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[
    Golden(input="What is 3 * 12?"),
    Golden(input="What is 8 * 6?")
])

for golden in dataset.evals_iterator():
    agent_executor.invoke(
        {"input": golden.input},
        config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
    )
main.py
import asyncio

dataset = EvaluationDataset(goldens=[
    Golden(input="What is 3 * 12?"),
    Golden(input="What is 8 * 6?")
])

for golden in dataset.evals_iterator():
    task = asyncio.create_task(
        agent_executor.ainvoke(
            {"input": golden.input},
            config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
        )
    )
    dataset.evaluate(task)

✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

Component-level Evals

Using deepeval, you can now evaluate individual components of your LangChain application.

LLM

Define metrics in the metadata of the all the BaseLanguageModels in your LangChain application.

main.py
from langchain_openai import ChatOpenAI
from deepeval.metrics import AnswerRelevancyMetric
...

llm = ChatOpenAI(
    model="gpt-4o-mini", 
    metadata={"metric": [AnswerRelevancyMetric()]}
).bind_tools([get_weather])

Tool

To pass metrics to the tools, you can use the DeepEval's LangChain tool decorator.

main.py
# from langchain_core.tools import tool
from deepeval.integrations.langchain import tool
from deepeval.metrics import AnswerRelevancyMetric
...

@tool(metric=[AnswerRelevancyMetric()])
def get_weather(location: str) -> str:
    """Get the current weather in a location."""
    return f"It's always sunny in {location}!"

Evals in Production

To run online evaluations in production, simply replace metrics in CallbackHandler with a metric collection string from Confident AI, and push your LangChain agent to production.

result = agent_executor.invoke(
    {"input": "What is 8 multiplied by 6?"},
    config={"callbacks": [CallbackHandler(metric_collection="<metric-collection-name-with-task-completion>")]}
)

On this page