LlamaIndex
LlamaIndex is an orchestration framework that simplifies data ingestion, indexing, and querying, allowing developers to integrate private and public data into LLM applications for retrieval-augmented generation and knowledge augmentation.
We recommend logging in to Confident AI to view your LlamaIndex evaluation traces.
deepeval login
End-to-End Evals
deepeval allows you to evaluate LlamaIndex applications end-to-end in under a minute.
Configure LlamaIndex
Setup tracing for LlamaIndex and create an Agent. Use trace context manager to set up the AgentSpanContext (or LlmSpanContext if you want to evaluate the LLM span).
import asyncio
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument
from deepeval.integrations.llama_index import instrument_llama_index
from deepeval.tracing.trace_context import AgentSpanContext
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import trace
instrument_llama_index(instrument.get_dispatcher())
def multiply(a: float, b: float) -> float:
"""Useful for multiplying two numbers."""
return a * b
agent = FunctionAgent(
tools=[multiply],
llm=OpenAI(model="gpt-4o-mini"),
system_prompt="You are a helpful assistant that can perform calculations.",
)
answer_relevancy_metric = AnswerRelevancyMetric()
async def llm_app(input: str):
agent_span_context = AgentSpanContext(
metrics=[answer_relevancy_metric],
)
with trace(agent_span_context=agent_span_context):
return await agent.run(input)
Only metrics with LLM parameters input and output are eligible for evaluation.
Run evaluations
Create an EvaluationDataset and invoke your LlamaIndex application for each golden within the evals_iterator() loop to run end-to-end evaluations.
- Asynchronous
from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset(
goldens=[Golden(input="What is 3 * 12?"), Golden(input="What is 4 * 13?")]
)
for golden in dataset.evals_iterator():
task = asyncio.create_task(llm_app(golden.input))
dataset.evaluate(task)
✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.
View on Confident AI (optional)
If you need to evaluate individual components of your LlamaIndex application, set up tracing instead.
Evals in Production
To run online evaluations in production, simply replace metric_collection in AgentSpanContext with a metric collection string from Confident AI, and push your LlamaIndex agent to production.
async def llm_app(input: str):
agent_span_context = AgentSpanContext(
metric_collection="test_collection_1",
)
with trace(agent_span_context=agent_span_context):
return await agent.run(input)