Task Completion
LLM-as-a-judge
Single-turn
Referenceless
Agent
The task completion metric uses LLM-as-a-judge to evaluate how effectively an LLM agent accomplishes a task. Task Completion is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.
info
Task Completion analyzes your agent's full trace to determine task success, which requires setting up tracing.
Usage
To begin, set up tracing and simply supply the TaskCompletionMetric()
to your agent's @observe
tag.
from deepeval.tracing import observe
from deepeval.dataset import Golden, EvaluationDataset
from deepeval.metrics import TaskCompletionMetric
@observe()
def trip_planner_agent(input):
destination = "Paris"
days = 2
@observe()
def restaurant_finder(city):
return ["Le Jules Verne", "Angelina Paris", "Septime"]
@observe()
def itinerary_generator(destination, days):
return ["Eiffel Tower", "Louvre Museum", "Montmartre"][:days]
itinerary = itinerary_generator(destination, days)
restaurants = restaurant_finder(destination)
return itinerary + restaurants
# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="This is a test query")])
# Initialize metric
task_completion = TaskCompletionMetric(threshold=0.7, model="gpt-4o")
# Loop through dataset
for goldens in dataset.evals_iterator(metrics=[task_completion]):
trip_planner_agent(golden.input)
There are SEVEN optional parameters when creating a TaskCompletionMetric
:
- [Optional]
threshold
: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
task
: a string representing the task to be completed. If no task is supplied, it is automatically inferred from the trace. Defaulted to theNone
- [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4o'. - [Optional]
include_reason
: a boolean which when set toTrue
, will include a reason for its evaluation score. Defaulted toTrue
. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
.
To learn more about how the evals_iterator
work, click here.
How Is It Calculated?
The TaskCompletionMetric
score is calculated according to the following equation:
- Task and Outcome are extracted from the trace (or test case for end-to-end) using an LLM.
- The Alignment Score measures how well the outcome aligns with the extracted (or user-provided) task, as judged by an LLM.