LangChain

LangChain is an open-source framework for developing applications powered by large language models, enabling chaining of LLMs with external data sources and expressive workflows to build advanced generative AI solutions.

End-to-End Evals

deepeval allows you to evaluate LangChain applications end-to-end in under a minute.

Configure LangChain

Create a CallbackHandler with a list of task completion metrics you wish to use, and pass it to your LangChain application's invoke method.

main.py

from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

from deepeval.integrations.langchain import CallbackHandler

from deepeval.metrics import TaskCompletionMetric

@tool
def multiply(a: int, b: int) -> int:
    """Returns the product of two numbers"""
    return a * b

llm = ChatOpenAI(model="gpt-4o-mini")

agent_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that can perform mathematical operations."),
    ("human", "{input}"),
    MessagesPlaceholder("agent_scratchpad"),
])

agent = create_tool_calling_agent(llm, [multiply], agent_prompt)
agent_executor = AgentExecutor(agent=agent, tools=[multiply], verbose=True)

# result = agent_executor.invoke(
#    {"input": "What is 8 multiplied by 6?"},
#    config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
#)

#print(result)

Run evaluations

Create an EvaluationDataset and invoke your LangChain application for each golden within the evals_iterator() loop to run end-to-end evaluations.

main.py

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[
    Golden(input="What is 3 * 12?"),
    Golden(input="What is 8 * 6?")
])

for golden in dataset.evals_iterator():
    agent_executor.invoke(
        {"input": golden.input},
        config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
    )

main.py

import asyncio

dataset = EvaluationDataset(goldens=[
    Golden(input="What is 3 * 12?"),
    Golden(input="What is 8 * 6?")
])

for golden in dataset.evals_iterator():
    task = asyncio.create_task(
        agent_executor.ainvoke(
            {"input": golden.input},
            config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
        )
    )
    dataset.evaluate(task)

✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

from langchain_openai import ChatOpenAI
from deepeval.metrics import AnswerRelevancyMetric
...

llm = ChatOpenAI(
    model="gpt-4o-mini", 
    metadata={"metric": [AnswerRelevancyMetric()]}
).bind_tools([get_weather])

Tool

To pass metrics to the tools, you can use the DeepEval's LangChain tool decorator.

main.py

# from langchain_core.tools import tool
from deepeval.integrations.langchain import tool
from deepeval.metrics import AnswerRelevancyMetric
...

@tool(metric=[AnswerRelevancyMetric()])
def get_weather(location: str) -> str:
    """Get the current weather in a location."""
    return f"It's always sunny in {location}!"

Evals in Production

To run online evaluations in production, simply replace metrics in CallbackHandler with a metric collection string from Confident AI, and push your LangChain agent to production.

result = agent_executor.invoke(
    {"input": "What is 8 multiplied by 6?"},
    config={"callbacks": [CallbackHandler(metric_collection="<metric-collection-name-with-task-completion>")]}
)