CrewAI

CrewAI is a lean, independent Python framework designed for creating and orchestrating autonomous multi-agent AI systems, offering high flexibility, speed, and precision control for complex automation tasks.

End-to-End Evals

deepeval allows you to evaluate CrewAI applications end-to-end in under a minute.

Configure CrewAI

Create a Crew and use instrument_crewai to instrument your CrewAI application.

main.py

import random

from crewai import Task, Crew, Agent
from crewai.tools import tool

from deepeval.integrations.crewai import instrument_crewai

instrument_crewai()

@tool
def get_weather(city: str) -> str:
    """Fetch weather data for a given city. Returns temperature and conditions."""
    weather_data = {
        "New York": "Partly Cloudy",
        "London": "Rainy",
        "Tokyo": "Sunny",
        "Paris": "Cloudy",
        "Sydney": "Clear",
    }

    condition = weather_data.get(city, "Clear")
    temperature = f"{random.randint(45, 95)}°F"
    humidity = f"{random.randint(30, 90)}%"

    return f"Weather in {city}: {temperature}, {condition}, Humidity: {humidity}"


agent = Agent(
    role="Weather Reporter",
    goal="Provide accurate and helpful weather information to users.",
    backstory="An experienced meteorologist who loves helping people plan their day with accurate weather reports.",
    tools=[get_weather],
    verbose=True,
)

task = Task(
    description="Get the current weather for {city} and provide a helpful summary.",
    expected_output="A clear weather report including temperature, conditions, and humidity.",
    agent=agent,
)

crew = Crew(
    agents=[agent],
    tasks=[task],
)

Run evaluations

Create an EvaluationDataset and invoke your CrewAI application for each golden within the evals_iterator() loop to run end-to-end evaluations. Pass the metrics to the trace context manager.

main.py

from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset, Golden

answer_relavancy_metric = AnswerRelevancyMetric()

dataset = EvaluationDataset(
    goldens=[
        Golden(input="London"),
        Golden(input="Paris"),
    ]
)

for golden in dataset.evals_iterator():
    with trace(trace_metrics=[answer_relavancy_metric]):
        crew.kickoff({"city": golden.input})

main.py

from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset, Golden

answer_relavancy_metric = AnswerRelevancyMetric()

dataset = EvaluationDataset(
    goldens=[
        Golden(input="London"),
        Golden(input="Paris"),
    ]
)

async def run_crewai_e2e_async(input: str):
    with trace(trace_metrics=[answer_relavancy_metric]):
        await crew.kickoff_async({"city": input})

for golden in dataset.evals_iterator():
    task = asyncio.create_task(run_crewai_e2e_async(golden.input))
    dataset.evaluate(task)

✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

View on Confident AI (optional)

Evals in Production

To run online evaluations in production, replace metrics with a metric collection string from Confident AI, and push your CrewAI agent to production.

...
with trace(trace_metric_collection="test_collection_1"):
    result = crew.kickoff(
        "city": "London"
    )

End-to-End Evals

Configure CrewAI

Run evaluations

View on Confident AI (optional)

Evals in Production

On this page