With Literal AI, you can test different prompt, model and LLM application versions in a A/B tests.

We are currently working on a full A/B Testing feature for Literal AI. In the near future you can use Literal AI to:

  • Create A/B tests with different prompt, model or application versions. You can define populations with either Tags or Metadata.
  • Monitor A/B tests charts in a dashboard, grouped by population in a time range.

For now, you can run A/B tests in your code (1), or by human evaluation (2).

First, make two different prompts. For example, you can create different prompt system messages, or change the LLM settings. Then, randomly assign a prompt version to a new conversation. Make sure to attach a tag per group to this Thread, for later reference.

Now, you can

  1. Pull the Threads from Literal AI to your code, filtered by Tags (example), run your evaluation on both groups, and compare the results.
  2. Or, you pull the Threads from Literal AI to your code filtered by Tags. Then, you view the scores given by human feedback (by users), and compare the two groups.

An example of option 1 is available in this tutorial.