The LLM Evaluation Framework

$ used by some of the world's leading AI companies, DeepEval enables you to build reliable evaluation pipelines to test any AI system_

Delivered by

Confident AI

Why DeepEval? The most comprehensive framework to evaluate AI apps.

Native integration with Pytest, that fits right in your CI workflow.

50+ research-backed metrics, including custom G-Eval and deterministic metrics.

Covers any use cases, any system architecture, including multi-turn.

Evaluate text, images, and audio with built-in multi-modal test cases.

No test data? No problem. Generate synthetic data and simulate conversations.

No need to manually tweak prompts. DeepEval automatically optimizes prompts for you.

SOTA Evaluation Techniques Research-backed metrics to ensure utmost reliability.

Criteria-based, chain-of-thought reasoning for nuanced, subjective scoring via form-filling paradigms.

A tree-based, directed acyclic graph approach for evaluating objective multi-step conditional scoring.

Question-Answer Generation for equation-based scoring based on close-ended questions.

An All-in-One Eval Ecosystem Use DeepEval on Confident AI

Regression Testing

AI Experiments

Dataset Management

Observability & Tracing

Online Monitoring

Human Annotations

By the authors of DeepEval, Confident AI is a cloud LLM evaluation platform. It allows you to use DeepEval for team-wide, collaborative AI testing.

Built for Production-Grade Standards Fits right in your existing AI stack.

The Framework of Choice When Reliability Matters

$ pip install deepeval