LlamaIndex

LlamaIndex is an orchestration framework that simplifies data ingestion, indexing, and querying, allowing developers to integrate private and public data into LLM applications for retrieval-augmented generation and knowledge augmentation.

tip

We recommend logging in to Confident AI to view your LlamaIndex evaluation traces.

deepeval login

End-to-End Evals

deepeval allows you to evaluate LlamaIndex applications end-to-end in under a minute.

Configure LlamaIndex

Setup tracing for LlamaIndex and create an Agent. Use trace context manager to set up the AgentSpanContext (or LlmSpanContext if you want to evaluate the LLM span).

main.py
import asyncio

from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument

from deepeval.integrations.llama_index import instrument_llama_index
from deepeval.tracing.trace_context import AgentSpanContext
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import trace

instrument_llama_index(instrument.get_dispatcher())


def multiply(a: float, b: float) -> float:
    """Useful for multiplying two numbers."""
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=OpenAI(model="gpt-4o-mini"),
    system_prompt="You are a helpful assistant that can perform calculations.",
)

answer_relevancy_metric = AnswerRelevancyMetric()

async def llm_app(input: str):
    agent_span_context = AgentSpanContext(
        metrics=[answer_relevancy_metric],
    )
    with trace(agent_span_context=agent_span_context):
        return await agent.run(input)

info

Only metrics with LLM parameters input and output are eligible for evaluation.

Run evaluations

Create an EvaluationDataset and invoke your LlamaIndex application for each golden within the evals_iterator() loop to run end-to-end evaluations.

Asynchronous

main.py
from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(
    goldens=[Golden(input="What is 3 * 12?"), Golden(input="What is 4 * 13?")]
)

for golden in dataset.evals_iterator():
    task = asyncio.create_task(llm_app(golden.input))
    dataset.evaluate(task)

✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

View on Confident AI (optional)

note

If you need to evaluate individual components of your LlamaIndex application, set up tracing instead.

Evals in Production

To run online evaluations in production, simply replace metric_collection in AgentSpanContext with a metric collection string from Confident AI, and push your LlamaIndex agent to production.

async def llm_app(input: str):
    agent_span_context = AgentSpanContext(
        metric_collection="test_collection_1",
    )
    with trace(agent_span_context=agent_span_context):
        return await agent.run(input)

LlamaIndex

End-to-End Evals​

Configure LlamaIndex

Run evaluations

View on Confident AI (optional)

Evals in Production​

End-to-End Evals

Evals in Production