Building and evaluating a RAG chatbot

In this guide you will learn how to build, monitor, evaluate and improve a conversational application. The chatbot uses Retrieval Augmented Generation (RAG) to retrieve contexts for answer generation.

In combination with Literal AI we will use the following tools :

OpenAI’s embeddings and models
Chainlit as the basis for our chat application
Pinecone as the vector database where we will store the data for our RAG

1. Build a RAG chatbot with Chainlit

First, we pick Pinecone as vector database within our RAG. Another vector store could also be used. Make sure to store your API keys of Literal AI, OpenAI and Pinecone in a .env file as follows. Make sure you have data documents stored in this vector database.

LITERAL_API_KEY=
OPENAI_API_KEY=
PINECONE_API_KEY=
PINECONE_CLOUD=
PINECONE_REGION=

The whole code for the chatbot lives in a single file app.py, which you can browse at Documentation - RAG application.

Beyond the initial imports, here is how the Python code for this chatbot is structured.

First, we create a Pinecone vector index. This vector database is used to store the documents and the embeddings of the data that you want your chatbot to use. For example, you can embed and store pages of a technical product documentation.

Check out the Embed the documentation section to populate your Pinecone index with your own documents.

pinecone_client = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
pinecone_spec = ServerlessSpec(
    cloud=os.environ.get("PINECONE_CLOUD"),
    region=os.environ.get("PINECONE_REGION"),
)

def create_pinecone_index(name, client, spec):
    """
    Create a Pinecone index if it does not already exist.
    """
    if name not in client.list_indexes().names():
        client.create_index(name, dimension=1536, metric="cosine", spec=spec)
    return pinecone_client.Index(name)

pinecone_index = create_pinecone_index(
    "literal-rag-index", pinecone_client, pinecone_spec
)

We then use a prompt defined from a file - or fetched from Literal AI. The prompt definition includes the name, template_messages, tools and settings. Here is what the prompt definition looks like:

{
  "name": "RAG prompt - Tooled",
  "template_messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant identifying questions related to Literal AI documentation. Common concepts in the Literal AI doc include: Observability, Prompt, Monitoring, Thread, Step, Run, Generation, Tag, Score, User, Evaluation, Dataset, Experiment, Projects and Roles. Keep it short, and if available prefer responding with code."
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_relevant_documentation_chunks",
        "description": "Gets the relevant documentation chunks for a given question or query.",
        "parameters": {
          "type": "object",
          "properties": {
            "question": {
              "type": "string",
              "description": "The question or query to answer on the documentation."
            },
            "top_k": {
              "type": "integer",
              "description": "The number of documentation chunks to retrieve, e.g. 5."
            }
          },
          "required": ["question"]
        }
      }
    }
  ],
  "settings": {
    "provider": "openai",
    "stop": [],
    "model": "gpt-4-turbo-preview",
    "top_p": 1,
    "max_tokens": 512,
    "temperature": 0,
    "presence_penalty": 0,
    "frequency_penalty": 0
  }
}

We initialize a user session with the prompt. When a user question comes in, we trigger our RAG agent logic, which we will study in the next step.

prompt_path = os.path.join(os.getcwd(), "app/prompts/rag.json")

# We load the RAG prompt in Literal AI to track prompt iteration and
# enable LLM replays from Literal AI.
with open(prompt_path, "r") as f:
    rag_prompt = json.load(f)

    prompt = client.api.get_or_create_prompt(
        name=rag_prompt["name"],
        template_messages=rag_prompt["template_messages"],
        settings=rag_prompt["settings"],
        tools=rag_prompt["tools"],
    )

@cl.on_chat_start
async def on_chat_start():
    """
    Send a welcome message and set up the initial user session on chat start.
    """
    await cl.Message(
        content="Welcome, please ask me anything about the Literal AI documentation!",
        disable_feedback=True
    ).send()

    cl.user_session.set("messages", prompt.format_messages())
    cl.user_session.set("settings", prompt.settings)
    cl.user_session.set("tools", prompt.tools)

@cl.on_message
async def main(message: cl.Message):
    """
    Main message handler for incoming user messages.
    """
    await rag_agent(message.content)

Vanilla RAG pipelines usually take in a user’s query and immediately use it to retrieve the relevant contexts from their vector database.

To highlight the typical flow of a more general LLM-based application (agent), we chose to instead perform a first LLM call with mention of the retrieval tool. We thus let the “planner” decide whether to make use of the retrieval: if the question is a general purpose, we don’t need to query the vector database.

async def llm_tool(question):
    """
    Generate a response from the LLM based on the user's question.
    """
    messages = cl.user_session.get("messages", []) or []
    messages.append({"role": "user", "content": question})

    settings = cl.user_session.get("settings", {}) or {}
    settings["tools"] = cl.user_session.get("tools")
    settings["tool_choice"] = "auto"

    response = await openai_client.chat.completions.create(
        messages=messages,
        **settings,
    )

    response_message = response.choices[0].message
    messages.append(response_message)
    return response_message

The relevant documents from the vector store are then retrieved and used as context to answer the user query.

@cl.step(name="Embed", type="embedding")
async def embed(question, model="text-embedding-ada-002"):
    """
    Embed a question using the specified model and return the embedding.
    """
    embedding = await openai_client.embeddings.create(input=question, model=model)

    return embedding.data[0].embedding


@cl.step(name="Retrieve", type="retrieval")
async def retrieve(embedding, top_k):
    """
    Retrieve top_k closest vectors from the Pinecone index using the provided embedding.
    """
    if pinecone_index == None:
        raise Exception("Pinecone index not initialized")
    response = pinecone_index.query(
        vector=embedding, top_k=top_k, include_metadata=True
    )
    return response.to_dict()


@cl.step(name="Retrieval", type="tool")
async def get_relevant_documentation_chunks(question, top_k=5):
    """
    Retrieve relevant documentation chunks based on the question embedding.
    """
    embedding = await embed(question)

    retrieved_chunks = await retrieve(embedding, top_k)

    return [match["metadata"]["text"] for match in retrieved_chunks["matches"]]

Based on the tool results - the relevant contexts retrieved from Pinecone - we let an LLM generate an answer to the user query.

async def llm_answer(tool_results):
    """
    Generate an answer from the LLM based on the results of tool calls.
    """
    messages = cl.user_session.get("messages", []) or []
    messages.extend(tool_results)

    settings = cl.user_session.get("settings", {}) or {}
    settings["stream"] = True

    stream = await openai_client.chat.completions.create(
        messages=messages,
        **settings,
    )

    curr_step = cl.context.current_step
    if not curr_step:
        return ""

    async for part in stream:
        if token := part.choices[0].delta.content or "":
            await curr_step.stream_token(token)
    await curr_step.update()

    messages.append({"role": "assistant", "content": curr_step.output})
    return curr_step.output

Note how the different steps of the RAG pipeline are using Literal AI step decorators: @cl.step(name="", type=""). This is done in order to follow the steps in a thread in Literal AI.

2. Log the chat conversations

If you are using Chainlit (as in the example above), logging of threads is automatically done. If you prefer using another method, like FastAPI, Flask or TypeScript, you can use the Literal AI SDK. For Python, use:

with client.thread() as thread:
    # do something

The threads or conversations will become visible in the Literal AI UI.

3. Run a few manual sample iterations and feedback

For initial testing purposes, you can manually run a few questions to your chatbot, in order to get familiar with the process and results. The results and steps that the chatbot took in order to come to an answer are visible in the Literal AI UI, under “Threads”. For a real application, you want an extensive test, but it can be useful to get a first glance of the model.

Users can give feedback in the chatbot application to indicate how happy they are with the result. This gives you an indication on how well the model behaves.

As administrator, you can also give feedback on the chatbot’s answers in the threads that already happened.

4. Create a Dataset

In order to test the chatbot’s model, you need to create a dataset. First, decide on the evaluation criteria that you want to test. RAGAS defines various metrics that can be tested in isolation, such as faithfulness, answer relevancy, context recall and answer correctness. In this tutorial, we are going to calculate faithfulness ans answer correctness. Faithfullness measures the factual consistency of the answer given the context. Faithfullness is measured by dividing number of truthful claims that can be inferred from the given context by the total number of claims in the chatbot’s generated answer. Answer Correctness measures the accuracy of the generated answer compared to the ground truth.

The dataset for this experiment will consist of examples that we add by hand. The examples have an input and an expected_output. The input is the user query, and the expected_output is the expected answer from the chatbot.

# Create Dataset in Literal AI
dataset = literal_client.api.create_dataset(
    name="TestDataset", description="A dataset to store samples.", metadata={ "isDemo": True }
)

# The items with ground truth to add to the dataset
items = [
  {
    "input": "What is Literal AI?", 
    "expected_output": "Literal AI is a platform for LLM observability, evaluation and prompt collaboration.",
  }, {
    "input": "What does Literal AI offer?", 
    "expected_output": "The platform has a UI and a Python and TypeScript client.",
  }
]

# Add items to Dataset in Literal AI
for item in items:
    dataset.create_item(
      input={ "content": item["input"] },
      expected_output={ "content": item["expected_output"] }
    )

Note that for a serious evaluation you would want more than two examples to evaluate.

5. Run chatbot on dataset items

In order to evaluate the chatbot, we need to run the test cases that we saved in the dataset using the LLM model of the chatbot. We will add the chatbot’s answers and contexts to the item list. Run the input questions that you put in the test dataset on the chatbot. In this example, Chainlit is used. You can open the Chainlit chatbot and ask your questions there, in two separate chats. These threads will be visible in the Literal AI UI, under “Threads”.

Test Threads

Test Thread

Next, you want to add the chatbot’s answers and contexts to the list of test items that we created in the previous step. In the next step we can use this list to evaluate using Ragas. Make sure to use the Step ID from the LLM step in the Thread, and Step ID from the Retrieve step in the Thread.

import ast 
# Replace with your step IDs
llm_steps = ["b5f69f7e-a9d2-4a6f-8b1a-61e8a2c0a374", "b165f8c2-ee08-44ab-9b3a-ae05a669ccaa"]
context_steps = ["3c848131-bb52-4f38-85c1-afdc1422e748", "2d96877d-4beb-4337-978b-f8c19daf3527"]

# Add answers to the item list
for idx, llm_step in enumerate(llm_steps):
    llm_step_data = literal_client.api.get_step(llm_step)
    items[idx]["answer"] = llm_step_data.output["content"]

    contexts = []
    contexts_data = literal_client.api.get_step(context_steps[idx])
    contexts_data_json = ast.literal_eval(contexts_data.output["content"])["matches"]
    for context in contexts_data_json:
        contexts.append(context["id"])
    items[idx]["context"] = contexts

6. Evaluate items

Next, we can evaluate the two generated answers on faithfulness and answer correctness. For evaluation, we use Ragas, a Python framework to evaluation RAG pipelines. Note that this evaluation is done outside of Literal AI (offline).

from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness
from datasets import Dataset

test_dataset = {
    'question': [item["input"] for item in items],
    'answer': [item["answer"] for item in items],
    'contexts': [item["context"] for item in items],
    'ground_truth': [item["expected_output"] for item in items]
}

result = evaluate(
    Dataset.from_dict(test_dataset),
    metrics=[
        faithfulness, 
        answer_correctness],
)

df = result.to_pandas()
print(df.head())

5. Analyze results

Finally, you can analyze the results of the evaluation. The results will be shown in the terminal.

Evaluation result using RAGAS

You can see that both answers have the highest score for faithfulness, meaning that the chatbot’s answers are factually consistent given the context. The answer correctness is also high for the first results, but low for the second result. This corresponds to human evaluation if we look at the examples we created. Based on this result, you can decide to improve the chatbot’s model or prompt.

You can also see the experiment results in Literal AI, if you upload the experiment.

experiment = dataset.create_experiment(
    name="Test Experiment",
    prompt_id= literal_client.api.get_prompt(name="RAG prompt").id,
    params=[{ "type": faithfulness.name, "top_k": 1 }, { "type": answer_correctness.name, "top_k": 1 }]
)

# Log each experiment result.
for index, row in df.iterrows():
    scores = [{ 
        "name": faithfulness.name,
        "type": "AI",
        "value": row[faithfulness.name]
    }, { 
        "name": answer_correctness.name,
        "type": "AI",
        "value": row[answer_correctness.name]
    }]

    experiment_item = {
        "datasetItemId": dataset.items[index].id,
        "scores": scores,
        "input": { "question": row["question"] },
        "output": { "contexts": row["contexts"].tolist() }
    }
    
    experiment.log(experiment_item)

Evaluation result using RAGAS in Literal AI

Overview

Python Tutorials

Typescript Tutorials

1. Build a RAG chatbot with Chainlit

2. Log the chat conversations

3. Run a few manual sample iterations and feedback

4. Create a Dataset

5. Run chatbot on dataset items

6. Evaluate items

5. Analyze results

Overview

Python Tutorials

Typescript Tutorials

​1. Build a RAG chatbot with Chainlit

​2. Log the chat conversations

​3. Run a few manual sample iterations and feedback

​4. Create a Dataset

​5. Run chatbot on dataset items

​6. Evaluate items

​5. Analyze results

1. Build a RAG chatbot with Chainlit

2. Log the chat conversations

3. Run a few manual sample iterations and feedback

4. Create a Dataset

5. Run chatbot on dataset items

6. Evaluate items

5. Analyze results