With Literal AI, you can test different prompt, model and LLM application versions in a A/B tests.

We are currently working on a full A/B Testing feature for Literal AI. In the near future you can use Literal AI to:

  • Create A/B tests with different prompt, model or application versions. You can define populations with either Tags or Metadata.
  • Monitor A/B tests charts in a dashboard, grouped by population in a time range.

For now, you can run A/B tests in your code (1), or by human evaluation (2).

First, make two different prompts. For example, you can create different prompt system messages, or change the LLM settings. Then, randomly assign a prompt version to a new conversation. Make sure to attach a tag per group to this Thread, for later reference.

import os
import random
from literalai import LiteralClient

client = LiteralClient(api_key=os.getenv("LITERAL_API_KEY"))

# pull the prompts from Literal AI
prompt_A = literal_client.api.get_prompt(name="Prompt A")
prompt_B = literal_client.api.get_prompt(name="Prompt B")

@literal_client.step(type="run", name="Agent Run")
def answer(user_query: str, group: str):
    # assign prompt A or B
    if group == "A":
        return "answer to question with prompt A" # e.g. llm call with prompt A
    else:
        return "answer to question with prompt B" # e.g. llm call with prompt B

with literal_client.thread() as thread:

    question = "What are the features of Literal AI?"

    # assign prompt A or B
    if random.random() < 0.5:
        group = "A"
    else:
        group = "B"
        
    # add tag to thread for later reference
    thread.tags=[group]

    literal_client.message(content=question, type="user_message", name="User")
    answer = answer(question, group)
    literal_client.message(content=answer, type="assistant_message", name="My Assistant")

Now, you can

  1. Pull the Threads from Literal AI to your code, filtered by Tags (example), run your evaluation on both groups, and compare the results.
  2. Or, you pull the Threads from Literal AI to your code filtered by Tags. Then, you view the scores given by human feedback (by users), and compare the two groups.

An example of option 1 is available in this tutorial.