Introduction

Evaluation is key to enable continuous deployment of LLM-based applications and guarantee that newer versions perform better than previous ones. To best capture the user experience one must understand the multiple steps which make up the application. As AI applications grow in complexity, they tend to chain multiple steps.

Literal AI lets you log & monitor the various steps of your LLM application. By doing so, you can continuously improve the performance of your LLM system, building the most relevant metrics:

  • Thread — user satisfaction, etc.
  • Agent Run — task completion, etc.
  • LLM Generation — hallucination, etc.

An example is the vanilla Retrieval Augmented Generation (RAG), which augments Large Language Models (LLMs) with domain-specific data. Examples of metrics you can score against are:

  • context relevancy
  • faithfulness
  • answer relevancy
  • etc.

Compare Experiments

If you have run two Experiments (for example with a different prompt or LLM settings) on the same dataset, you can compare the results on the Literal AI platform. You can view the scores in charts and go over each data item to see the difference.

Comparison of Experiments

Comparison of Experiments

Literal AI offers a compelling user/developer interface to build and interact with datasets. Through Dataset objects, users can iterate on a parameter of the LLM system - such as prompt template, LLM provider, temperature, etc - and view its impact via Experiments.

A developer can also iterate on their prompt template and check the impact of their modifications on production data.

Evaluations can either take place:

  • on a Dataset which contains a number of steps’ inputs/outputs to score — these are offline evaluations
  • on live production data — those are continuous or online evaluations

Offline evaluations

Offline Evaluations test LLMs against specific datasets. With offline evaluations, developers can leverage their favorite framework, such as promptfoo, Ragas, OpenAI Evals, LangChain Evaluators, etc.

Two examples on Offline Evaluations with Literal AI:

  • Prompt Iteration with Promptfoo: Our cookbook shows how a developer can leverage the promptfoo library in TypeScript to compute similarity scores and persist the results of their experiments directly on Literal AI.
  • Context Relevancy with Ragas: our notebook experiments on an example RAG application to score context relevancy.

For more examples on evaluations, please browse our Literal AI Cookbooks.

Evaluation - Literal AI Cookbooks

Explore additional evaluation concepts and frameworks.

Experiment API

To log the results of an Experiment, you can use the Literal AI clients. First, create an Experiment. An Experiment is always tied to a Dataset.

import os
from literalai import LiteralClient, ChatGeneration

literal_client = LiteralClient(api_key=os.getenv("LITERAL_API_KEY"))

experiment = literal_client.api.create_experiment(
    name="Foo",
    dataset_id="<DATASET_ID>",
    prompt_id="<PROMPT_ID>", # optional, prompt to link the experiment to
    params=[{ "foo": "bar" }]) # optional

Then, use this experiment to add items to.

scores = [{ 
    "name": "context_relevancy",
    "type": "AI",
    "value": 0.6
    }]

experiment_item = {
    "datasetItemId": "<DATASET_ITEM_ID>", # ID of the dataset item that is being scored
    "scores": scores, # scores in the format above
    "input": { "question": "question" },
    "output": { "content": "answer" } # output according to the LLM model, to be compared to the expected output stored in the dataset
}

experiment.log(experiment_item)

Online (Continuous) evaluations

Contact Us here to know more about how we perform continuous evaluations of LLM applications.