Tool Correctness

Q: Does Tool Correctness use an LLM?

Mostly not. The core score is deterministic — tools_called vs expected_tools, counting matches. An LLM is only used if you pass available_tools to judge optimality, and the final score is the minimum of both.

Q: How do I make matching stricter than just tool names?

Add ToolCallParams.INPUT_PARAMETERS or ToolCallParams.OUTPUT to evaluation_params to also match inputs/outputs. Use should_consider_ordering=True for call order, or should_exact_match=True for identical lists (overrides ordering).

Q: Tool Correctness vs Argument Correctness — what's the difference?

Tool Correctness checks the right tools were called, deterministically against expected_tools. Argument Correctness checks the arguments passed to those tools, via an LLM with no reference.

Q: Can I evaluate whether the agent picked the best tool, not just a valid one?

Yes — pass available_tools and an LLM judges whether the tools_called were optimal. Without it, the metric only checks calls against expected_tools deterministically.

Q: Can I use Tool Correctness with LangChain, OpenAI, or another framework?

Yes. deepeval auto-traces agents built with LangChain, OpenAI, LlamaIndex, CrewAI, and more — see all framework integrations.

LLM-as-a-judge

Single-turn

Referenceless

Agent

Multimodal

The tool correctness metric is an agentic LLM metric that assesses your LLM agent's function/tool calling ability. It is calculated by comparing whether every tool that is expected to be used was indeed called and if the selection of the tools made by the LLM agent were the most optimal.

Required Arguments

To use the ToolCorrectnessMetric, you'll have to provide the following arguments when creating an LLMTestCase:

input
actual_output
tools_called
expected_tools

Read the How Is It Calculated section below to learn how test case parameters are used for metric calculation.

Usage

The ToolCorrectnessMetric() can be used for end-to-end evaluation of text-based and multimodal test cases:

from deepeval import evaluate
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    # Replace this with the tools that was actually used by your LLM agent
    tools_called=[ToolCall(name="WebSearch"), ToolCall(name="ToolQuery")],
    expected_tools=[ToolCall(name="WebSearch")],
)
metric = ToolCorrectnessMetric()

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

from deepeval import evaluate
from deepeval.test_case import LLMTestCase, MLLMImage
from deepeval.metrics import ToolCorrectnessMetric

metric = ToolCorrectnessMetric(
    threshold=0.7,
    model="gpt-4.1",
    include_reason=True
)
test_case = LLMTestCase(
    input=f"What's in this image? {MLLMImage(...)}",
    actual_output=f"The image shows a pair of running shoes."
    tools_called=[ToolCall(name="ImageAnalysis"), ToolCall(name="ToolQuery")],
    expected_tools=[ToolCall(name="ImageAnalysis")],
)

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

There are EIGHT optional parameters when creating a ToolCorrectnessMetric:

[Optional] available_tools: a list of ToolCalls that give context on all the tools that were available to your LLM agent. This list is used to evaluate your agent's tool selection capability.
[Optional] threshold: a float representing the minimum passing threshold, defaulted to 0.5.
[Optional] evaluation_params: A list of ToolCallParams indicating the strictness of the correctness criteria, available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. For example, supplying a list containing ToolCallParams.INPUT_PARAMETERS but excluding ToolCallParams.OUTPUT, will deem a tool correct if the tool name and input parameters match, even if the output does not. Defaults to a an empty list.
[Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
[Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
[Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.
[Optional] should_consider_ordering: a boolean which when set to True, will consider the ordering in which the tools were called in. For example, if expected_tools=[ToolCall(name="WebSearch"), ToolCall(name="ToolQuery"), ToolCall(name="WebSearch")] and tools_called=[ToolCall(name="WebSearch"), ToolCall(name="WebSearch"), ToolCall(name="ToolQuery")], the metric will consider the tool calling to be correct. Applies to the tool name matching that is always performed, and defaulted to False.
[Optional] should_exact_match: a boolean which when set to True, will required the tools_called and expected_tools to be exactly the same. Applies to the tool name matching that is always performed, and additionally checks ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT when they are included in evaluation_params. Defaulted to False.

Within components

You can also run the ToolCorrectnessMetric within nested components for component-level evaluation.

from deepeval.dataset import Golden
from deepeval.tracing import observe, update_current_span
...

@observe(metrics=[metric])
def inner_component():
    # Set test case at runtime
    test_case = LLMTestCase(input="...", actual_output="...")
    update_current_span(test_case=test_case)
    return

@observe
def llm_app(input: str):
    # Component can be anything from an LLM call, retrieval, agent, tool use, etc.
    inner_component()
    return

evaluate(observed_callback=llm_app, goldens=[Golden(input="Hi!")])

As a standalone

You can also run the ToolCorrectnessMetric on a single test case as a standalone, one-off execution.

...

metric.measure(test_case)
print(metric.score, metric.reason)

How Is It Calculated?

The tool correctness metric score is calculated using the following steps:

Find the deterministic score for tools_called using the expected_tools using the following equation:

\text{Tool Correctness} = \frac{\text{Number of Correctly Used Tools (or Correct Input Parameters/Outputs)}}{\text{Total Number of Tools Called}}

This metric assesses the accuracy of your agent's tool usage by comparing the tools_called by your LLM agent to the list of expected_tools. A score of 1 indicates that every tool utilized by your LLM agent were called correctly according to the list of expected_tools, should_consider_ordering, and should_exact_match, while a score of 0 signifies that none of the tools_called were called correctly.

If the available_tools are provided, the ToolCorrectnessMetric also uses an LLM to find whether the tools_called were the most optimal for the given task using the available_tools as reference. The final score is the minimum of both scores. If available_tools is not provided, the LLM-based evaluation does not take place.

FAQs

Does Tool Correctness use an LLM?

Mostly not. The core score is deterministic — tools_called vs expected_tools, counting matches. An LLM is only used if you pass available_tools to judge optimality, and the final score is the minimum of both.

How do I make matching stricter than just tool names?

Add ToolCallParams.INPUT_PARAMETERS or ToolCallParams.OUTPUT to evaluation_params to also match inputs/outputs. Use should_consider_ordering=True for call order, or should_exact_match=True for identical lists (overrides ordering).

Tool Correctness vs Argument Correctness — what's the difference?

Tool Correctness checks the right tools were called, deterministically against expected_tools. Argument Correctness checks the arguments passed to those tools, via an LLM with no reference.

Can I evaluate whether the agent picked the best tool, not just a valid one?

Yes — pass available_tools and an LLM judges whether the tools_called were optimal. Without it, the metric only checks calls against expected_tools deterministically.

Can I use Tool Correctness with LangChain, OpenAI, or another framework?

Yes. deepeval auto-traces agents built with LangChain, OpenAI, LlamaIndex, CrewAI, and more — see all framework integrations.

Required Arguments

Usage

Within components

As a standalone

How Is It Calculated?

FAQs

On this page