CrewAI
CrewAI is a lean, independent Python framework designed for creating and orchestrating autonomous multi-agent AI systems, offering high flexibility, speed, and precision control for complex automation tasks.
End-to-End Evals
deepeval allows you to evaluate CrewAI applications end-to-end in under a minute.
Configure CrewAI
Create a Crew and use instrument_crewai to instrument your CrewAI application.
import random
from crewai import Task, Crew, Agent
from crewai.tools import tool
from deepeval.integrations.crewai import instrument_crewai
instrument_crewai()
@tool
def get_weather(city: str) -> str:
"""Fetch weather data for a given city. Returns temperature and conditions."""
weather_data = {
"New York": "Partly Cloudy",
"London": "Rainy",
"Tokyo": "Sunny",
"Paris": "Cloudy",
"Sydney": "Clear",
}
condition = weather_data.get(city, "Clear")
temperature = f"{random.randint(45, 95)}°F"
humidity = f"{random.randint(30, 90)}%"
return f"Weather in {city}: {temperature}, {condition}, Humidity: {humidity}"
agent = Agent(
role="Weather Reporter",
goal="Provide accurate and helpful weather information to users.",
backstory="An experienced meteorologist who loves helping people plan their day with accurate weather reports.",
tools=[get_weather],
verbose=True,
)
task = Task(
description="Get the current weather for {city} and provide a helpful summary.",
expected_output="A clear weather report including temperature, conditions, and humidity.",
agent=agent,
)
crew = Crew(
agents=[agent],
tasks=[task],
)Run evaluations
Create an EvaluationDataset and invoke your CrewAI application for each golden within the evals_iterator() loop to run end-to-end evaluations. Pass the metrics to the trace context manager.
from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset, Golden
answer_relavancy_metric = AnswerRelevancyMetric()
dataset = EvaluationDataset(
goldens=[
Golden(input="London"),
Golden(input="Paris"),
]
)
for golden in dataset.evals_iterator():
with trace(trace_metrics=[answer_relavancy_metric]):
crew.kickoff({"city": golden.input})from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset, Golden
answer_relavancy_metric = AnswerRelevancyMetric()
dataset = EvaluationDataset(
goldens=[
Golden(input="London"),
Golden(input="Paris"),
]
)
async def run_crewai_e2e_async(input: str):
with trace(trace_metrics=[answer_relavancy_metric]):
await crew.kickoff_async({"city": input})
for golden in dataset.evals_iterator():
task = asyncio.create_task(run_crewai_e2e_async(golden.input))
dataset.evaluate(task)✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.
Evals in Production
To run online evaluations in production, replace metrics with a metric collection string from Confident AI, and push your CrewAI agent to production.
...
with trace(trace_metric_collection="test_collection_1"):
result = crew.kickoff(
"city": "London"
)