Using Custom LLMs for Evaluation

All of deepeval's metrics uses LLMs for evaluation, and is currently defaulted to OpenAI's GPT models. However, for users that don't wish to use OpenAI's GPT models and would instead prefer other providers such as Claude (Anthropic), Gemini (Google), Llama-3 (Meta), or Mistral, deepeval provides an easy way for anyone to use literally ANY custom LLM for evaluation.

This guide will show you how to create custom LLMs for evaluation in deepeval, and demonstrate various methods to enforce valid JSON LLM outputs that are required for evaluation with the following examples:

Llama-3 8B from Hugging Face transformers
Mistral-7B v0.3 from Hugging Face transformers
Gemini 1.5 Flash from Vertex AI
Claude-3 Opus from Anthropic

Creating A Custom LLM

Here's a quick example on a custom Llama-3 8B model being used for evaluation in deepeval:

import transformers
import torch

from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

from deepeval.models import DeepEvalBaseLLM


class CustomLlama3_8B(DeepEvalBaseLLM):
    def __init__(self):
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
        )

        model_4bit = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Meta-Llama-3-8B-Instruct",
            device_map="auto",
            quantization_config=quantization_config,
        )
        tokenizer = AutoTokenizer.from_pretrained(
            "meta-llama/Meta-Llama-3-8B-Instruct"
        )

        self.model = model_4bit
        self.tokenizer = tokenizer

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        model = self.load_model()

        pipeline = transformers.pipeline(
            "text-generation",
            model=model,
            tokenizer=self.tokenizer,
            use_cache=True,
            device_map="auto",
            max_length=2500,
            do_sample=True,
            top_k=5,
            num_return_sequences=1,
            eos_token_id=self.tokenizer.eos_token_id,
            pad_token_id=self.tokenizer.eos_token_id,
        )

        return pipeline(prompt)

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return "Llama-3 8B"

There are SIX rules to follow when creating a custom LLM evaluation model:

Inherit DeepEvalBaseLLM.
Implement the get_model_name() method, which simply returns a string representing your custom model name.
Implement the load_model() method, which will be responsible for returning a model object.
Implement the generate() method with one and only one parameter of type string that acts as the prompt to your custom LLM.
The generate() method should return the generated string output from your custom LLM. Note that we called pipeline(prompt) to access the model generations in this particular example, but this could be different depending on the implementation of your custom model object.
Implement the a_generate() method, with the same function signature as generate(). Note that this is an async method. In this example, we called self.generate(prompt), which simply reuses the synchronous generate() method. However, although optional, you should implement an asynchronous version (if possible) to speed up evaluation.

Then, instantiate the CustomLlama3_8B class and test the generate() (or a_generate()) method out:

...

custom_llm = CustomLlama3_8B()
print(custom_llm.generate("Write me a joke"))

Finally, supply it to a metric to run evaluations using your custom LLM:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=custom_llm)
metric.measure(...)

Congratulations 🎉! You can now evaluate using any custom LLM of your choice on all LLM evaluation metrics offered by deepeval.

More Examples

Azure OpenAI Example

Here is an example of creating a custom Azure OpenAI model through langchain's AzureChatOpenAI module for evaluation:

from langchain_openai import AzureChatOpenAI
from deepeval.models.base_model import DeepEvalBaseLLM

class AzureOpenAI(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Azure OpenAI Model"

# Replace these with real values
custom_model = AzureChatOpenAI(
    openai_api_version=api_version,
    azure_deployment=azure_deployment,
    azure_endpoint=azure_endpoint,
    openai_api_key=openai_api_key,
)
azure_openai = AzureOpenAI(model=custom_model)
print(azure_openai.generate("Write me a joke"))

When creating a custom LLM evaluation model you should ALWAYS:

inherit DeepEvalBaseLLM.
implement the get_model_name() method, which simply returns a string representing your custom model name.
implement the load_model() method, which will be responsible for returning a model object.
implement the generate() method with one and only one parameter of type string that acts as the prompt to your custom LLM.
the generate() method should return the final output string of your custom LLM. Note that we called chat_model.invoke(prompt).content to access the model generations in this particular example, but this could be different depending on the implementation of your custom model object.
implement the a_generate() method, with the same function signature as generate(). Note that this is an async method. In this example, we called await chat_model.ainvoke(prompt), which is an asynchronous wrapper provided by LangChain's chat models.

Lastly, to use it for evaluation for an LLM-Eval:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=azure_openai)

Mistral 7B Example

Here is an example of creating a custom Mistral 7B model through Hugging Face's transformers library for evaluation:

from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models.base_model import DeepEvalBaseLLM

class Mistral7B(DeepEvalBaseLLM):
    def __init__(
        self,
        model,
        tokenizer
    ):
        self.model = model
        self.tokenizer = tokenizer

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        model = self.load_model()

        device = "cuda" # the device to load the model onto

        model_inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
        model.to(device)

        generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
        return self.tokenizer.batch_decode(generated_ids)[0]

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return "Mistral 7B"

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

mistral_7b = Mistral7B(model=model, tokenizer=tokenizer)
print(mistral_7b.generate("Write me a joke"))

Note that for this particular implementation, we initialized our Mistral7B model with an additional tokenizer parameter, as this is required in the decoding step of the generate() method.

You'll notice we simply reused generate() in a_generate(), because unfortunately there's no asynchronous interface for Hugging Face's transformers library, which would make all metric executions a synchronous, blocking process.

However, you can try offloading the generation process to a separate thread instead:

import asyncio

class Mistral7B(DeepEvalBaseLLM):
    # ... (existing code) ...

    async def a_generate(self, prompt: str) -> str:
        loop = asyncio.get_running_loop()
        return await loop.run_in_executor(None, self.generate, prompt)

Some additional considerations and reasons why you should be extra careful with this implementation:

Running the generation in a separate thread may not fully utilize GPU resources if the model is GPU-based.
There could be potential performance implications of frequently switching between threads.
You'd need to ensure thread safety if multiple async generations are happening concurrently and sharing resources.

Lastly, to use your custom Mistral7B model for evaluation:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=mistral_7b)

Google VertexAI Example

Here is an example of creating a custom Google's Gemini model through langchain's ChatVertexAI module for evaluation:

from langchain_google_vertexai import (
    ChatVertexAI,
    HarmBlockThreshold,
    HarmCategory
)
from deepeval.models.base_model import DeepEvalBaseLLM

class GoogleVertexAI(DeepEvalBaseLLM):
    """Class to implement Vertex AI for DeepEval"""
    def __init__(self, model):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Vertex AI Model"

# Initialize safety filters for vertex model
# This is important to ensure no evaluation responses are blocked
safety_settings = {
    HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
}

#TODO : Add values for project and location below
custom_model_gemini = ChatVertexAI(
    model_name="gemini-2.5-flash"
    , safety_settings=safety_settings
    , project= "<project-id>"
    , location= "<region>" #example : us-central1
)

# initialize the  wrapper class
vertexai_gemini = GoogleVertexAI(model=custom_model_gemini)
print(vertexai_gemini.generate("Write me a joke"))

To use it for evaluation for an LLM-Eval:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=vertexai_gemini)

AWS Bedrock Example

Here is an example of creating a custom AWS Bedrock model through the langchain_community.chat_models module for evaluation:

from langchain_community.chat_models import BedrockChat
from deepeval.models.base_model import DeepEvalBaseLLM

class AWSBedrock(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Azure OpenAI Model"

# Replace these with real values
custom_model = BedrockChat(
    credentials_profile_name=<your-profile-name>, # e.g. "default"
    region_name=<your-region-name>, # e.g. "us-east-1"
    endpoint_url=<your-bedrock-endpoint>, # e.g. "https://bedrock-runtime.us-east-1.amazonaws.com"
    model_id=<your-model-id>, # e.g. "anthropic.claude-v2"
    model_kwargs={"temperature": 0.4},
)

aws_bedrock = AWSBedrock(model=custom_model)
print(aws_bedrock.generate("Write me a joke"))

Finally, supply the newly created aws_bedrock model to LLM-Evals:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=aws_bedrock)

JSON Confinement for Custom LLMs

In the previous section, we learnt how to create a custom LLM, but if you've ever used custom LLMs for evaluation in deepeval, you may have encountered the following error:

ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.

This error arises when the custom LLM used for evaluation is unable to generate valid JSONs during metric calculation, which stops the evaluation process altogether. This happens because for smaller and less powerful LLMs, prompt engineering alone is not sufficient to enforce JSON outputs, which so happens to be the method used in deepeval's metrics. As a result, it's vital to find a workaround for users not using OpenAI's GPT models for evaluation.

All of deepeval's metrics require the evaluation model to generate valid JSONs to extract properties such as: reasons, verdicts, statements, and other types of LLM-generated responses that are later used for calculating metric scores, and so when the generated JSONs required to extract these properties are invalid (eg. missing brackets, incomplete string quotations, extra trailing commas, or mismatched keys), deepeval won't be able to use the necessary information required for metric calculation. Here's an example of an invalid JSON an open-source model like mistralai/Mistral-7B-Instruct-v0.3 might output:

{
    "reaso: "The actual output does directly not address the input",
}

Rewriting the `generate()` and `a_generate()` Method Signatures

In the previous section, we saw how the generate() and a_generate() methods must accept one argument of type str and return the corresponding LLM generated str. To enforce JSON outputs generated by your custom LLM, the first step is to rewrite the generate() and a_generate() method to accept an additional argument of type BaseModel, and output a BaseModel instead of a str.

Continuing from the CustomLlama3_8B example, here is what the method signature for the new generate() and a_generate() methods should look like:

from pydantic import BaseModel

class CustomLlama3_8B(DeepEvalBaseLLM):
    ...

    def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        pass

    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        return self.generate(prompt, schema)

You might be wondering, how does changing the method signature help with enforcing JSON outputs?

It helps because in deepeval's metrics, when there is a schema: BaseModel argument defined for the generate() and/or a_generate() method, deepeval will inject your generate methods with the Pydantic schemas which you can leverage to enforce JSON outputs. Let's see how we can do that.

Reimplementing the `generate()` and `a_generate()` Methods

With the new method signatures, deepeval will now automatically inject your custom LLM with the required Pydantic schemas, which you can leverage to enforce JSON outputs for each LLM generation.

There are many ways to leverage Pydantic schemas to confine LLMs to generate valid JSONs, and continuing with our CustomLlama3_8B example we will be using the lm-format-enforcer library to confine JSON outputs using the provided Pydantic schema.

pip install lm-format-enforcer

import json
import transformers

from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
    build_transformers_prefix_allowed_tokens_fn,
)

from deepeval.models import DeepEvalBaseLLM


class CustomLlama3_8B(DeepEvalBaseLLM):
    ...

    def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        # Same as the previous example above
        model = self.load_model()
        pipeline = transformers.pipeline(
            "text-generation",
            model=model,
            tokenizer=self.tokenizer,
            use_cache=True,
            device_map="auto",
            max_length=2500,
            do_sample=True,
            top_k=5,
            num_return_sequences=1,
            eos_token_id=self.tokenizer.eos_token_id,
            pad_token_id=self.tokenizer.eos_token_id,
        )

        # Create parser required for JSON confinement using lmformatenforcer
        parser = JsonSchemaParser(schema.model_json_schema())
        prefix_function = build_transformers_prefix_allowed_tokens_fn(
            pipeline.tokenizer, parser
        )

        # Output and load valid JSON
        output_dict = pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
        output = output_dict[0]["generated_text"][len(prompt) :]
        json_result = json.loads(output)

        # Return valid JSON object according to the schema DeepEval supplied
        return schema(**json_result)

    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        return self.generate(prompt, schema)

Now, try running metrics with the new generate() and a_generate() methods:

from deepeval.metrics import AnswerRelevancyMetric
...

custom_llm = CustomLlama3_8B()
metric = AnswerRelevancyMetric(model=custom_llm)
metric.measure(...)

Congratulations 🎉! You can now evaluate using any custom LLM of your choice on all LLM evaluation metrics offered by deepeval, without JSON errors (hopefully).

In the next section, we'll go through two JSON confinement libraries that covers a wide range of LLM interfaces.

JSON Confinement libraries

There are two JSON confinement libraries that you should know about depending on the custom LLM you're using:

lm-format-enforcer: The LM-Format-Enforcer is a versatile library designed to standardize the output formats of language models. It supports Python-based language models across various platforms, including popular frameworks such as transformers, langchain, llamaindex, llama.cpp, vLLM, Haystack, NVIDIA, TensorRT-LLM, and ExLlamaV2. For comprehensive details about the package and advanced usage instructions, please visit the LM-format-enforcer github page. The LM-Format-Enforcer combines a character-level parser with a tokenizer prefix tree. Unlike other libraries that strictly enforce output formats, this method enables LLMs to sequentially generate tokens that meet output format constraints, thereby enhancing the quality of the output.
instructor: Instructor is a user-friendly python library built on top of Pydantic. It enables straightforward confinement of your LLM's output by encapsulating your LLM client within an Instructor method. It simplifies the process of extracting structured data, such as JSON, from LLMs including GPT-3.5, GPT-4, GPT-4-Vision, and open-source models like Mistral/Mixtral, Anyscale, Ollama, and llama-cpp-python. For more information on advanced usage or integration with other models not covered here, please consult the documentation.

In the final section, we'll show several popular end-to-end examples of custom LLMs using either lm-format-enforcer or instructor for JSON confinement.

More Examples

`Mistral-7B-Instruct-v0.3` through `transformers`

Begin by installing the lm-format-enforcer package:

pip install lm-format-enforcer

Here's a full example of a JSON confined custom Mistral 7B model implemented through transformers:

import json

from pydantic import BaseModel
import torch

from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
    build_transformers_prefix_allowed_tokens_fn,
)
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

from deepeval.models import DeepEvalBaseLLM


class CustomMistral7B(DeepEvalBaseLLM):
    def __init__(self):
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
        )

        model_4bit = AutoModelForCausalLM.from_pretrained(
            "mistralai/Mistral-7B-Instruct-v0.3",
            device_map="auto",
            quantization_config=quantization_config,
        )
        tokenizer = AutoTokenizer.from_pretrained(
            "mistralai/Mistral-7B-Instruct-v0.3"
        )

        self.model = model_4bit
        self.tokenizer = tokenizer

    def load_model(self):
        return self.model

    def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        model = self.load_model()
        pipeline = pipeline(
            "text-generation",
            model=model,
            tokenizer=self.tokenizer,
            use_cache=True,
            device_map="auto",
            max_length=2500,
            do_sample=True,
            top_k=5,
            num_return_sequences=1,
            eos_token_id=self.tokenizer.eos_token_id,
            pad_token_id=self.tokenizer.eos_token_id,
        )

        # Create parser required for JSON confinement using lmformatenforcer
        parser = JsonSchemaParser(schema.model_json_schema())
        prefix_function = build_transformers_prefix_allowed_tokens_fn(
            pipeline.tokenizer, parser
        )

        # Output and load valid JSON
        output_dict = pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
        output = output_dict[0]["generated_text"][len(prompt) :]
        json_result = json.loads(output)

        # Return valid JSON object according to the schema DeepEval supplied
        return schema(**json_result)

    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        return self.generate(prompt, schema)

    def get_model_name(self):
        return "Mistral-7B v0.3"

`gemini-2.5-flash` through Vertex AI

Begin by installing the instructor package via pip:

pip install instructor

from pydantic import BaseModel
import google.generativeai as genai
import instructor

from deepeval.models import DeepEvalBaseLLM


class CustomGeminiFlash(DeepEvalBaseLLM):
    def __init__(self):
        self.model = genai.GenerativeModel(model_name="models/gemini-2.5-flash")

    def load_model(self):
        return self.model

    def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        client = self.load_model()
        instructor_client = instructor.from_gemini(
            client=client,
            mode=instructor.Mode.GEMINI_JSON,
        )
        resp = instructor_client.messages.create(
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            response_model=schema,
        )
        return resp

    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        return self.generate(prompt, schema)

    def get_model_name(self):
        return "Gemini 1.5 Flash"

`claude-3-opus` through Anthropic

Begin by installing the instructor package via pip:

pip install instructor

from pydantic import BaseModel
from anthropic import Anthropic

from deepeval.models import DeepEvalBaseLLM


class CustomClaudeOpus(DeepEvalBaseLLM):
    def __init__(self):
        self.model = Anthropic()

    def load_model(self):
        return self.model

    def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        client = self.load_model()
        instructor_client = instructor.from_anthropic(client)
        resp = instructor_client.messages.create(
            model="claude-3-opus-20240229",
            max_tokens=1024,
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            response_model=schema,
        )
        return resp

    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        return self.generate(prompt, schema)

    def get_model_name(self):
        return "Claude-3 Opus"

Others

For any additional implementations, please come and ask away in the DeepEval discord server, we'll be happy to have you.

On this page