In this guide you will learn how to build, monitor, evaluate and improve a conversational application. The chatbot uses Retrieval Augmented Generation (RAG) to retrieve contexts for answer generation.

In combination with Literal AI we will use the following tools :

  • OpenAI’s embeddings and models
  • Chainlit as the basis for our chat application
  • Pinecone as the vector database where we will store the data for our RAG

1. Build a RAG chatbot with Chainlit

First, we pick Pinecone as vector database within our RAG. Another vector store could also be used. Make sure to store your API keys of Literal AI, OpenAI and Pinecone in a .env file as follows. Make sure you have data documents stored in this vector database.

The whole code for the chatbot lives in a single file, which you can browse at Documentation - RAG application.

Beyond the initial imports, here is how the Python code for this chatbot is structured.

  1. First, we create a Pinecone vector index. This vector database is used to store the documents and the embeddings of the data that you want your chatbot to use. For example, you can embed and store pages of a technical product documentation.

    Check out the Embed the documentation section to populate your Pinecone index with your own documents.
  1. We then use a prompt defined from a file - or fetched from Literal AI. The prompt definition includes the name, template_messages, tools and settings. Here is what the prompt definition looks like:

We initialize a user session with the prompt. When a user question comes in, we trigger our RAG agent logic, which we will study in the next step.

  1. Vanilla RAG pipelines usually take in a user’s query and immediately use it to retrieve the relevant contexts from their vector database.

    To highlight the typical flow of a more general LLM-based application (agent), we chose to instead perform a first LLM call with mention of the retrieval tool. We thus let the “planner” decide whether to make use of the retrieval: if the question is a general purpose, we don’t need to query the vector database.
  1. The relevant documents from the vector store are then retrieved and used as context to answer the user query.
  1. Based on the tool results - the relevant contexts retrieved from Pinecone - we let an LLM generate an answer to the user query.

Note how the different steps of the RAG pipeline are using Literal AI step decorators: @cl.step(name="", type=""). This is done in order to follow the steps in a thread in Literal AI.

2. Log the chat conversations

If you are using Chainlit (as in the example above), logging of threads is automatically done. If you prefer using another method, like FastAPI, Flask or TypeScript, you can use the Literal AI SDK. For Python, use:

The threads or conversations will become visible in the Literal AI UI.

3. Run a few manual sample iterations and feedback

For initial testing purposes, you can manually run a few questions to your chatbot, in order to get familiar with the process and results. The results and steps that the chatbot took in order to come to an answer are visible in the Literal AI UI, under “Threads”. For a real application, you want an extensive test, but it can be useful to get a first glance of the model.

Users can give feedback in the chatbot application to indicate how happy they are with the result. This gives you an indication on how well the model behaves.

As administrator, you can also give feedback on the chatbot’s answers in the threads that already happened.

4. Create a Dataset

In order to test the chatbot’s model, you need to create a dataset. First, decide on the evaluation criteria that you want to test. RAGAS defines various metrics that can be tested in isolation, such as faithfulness, answer relevancy, context recall and answer correctness. In this tutorial, we are going to calculate faithfulness ans answer correctness. Faithfullness measures the factual consistency of the answer given the context. Faithfullness is measured by dividing number of truthful claims that can be inferred from the given context by the total number of claims in the chatbot’s generated answer. Answer Correctness measures the accuracy of the generated answer compared to the ground truth.

The dataset for this experiment will consist of examples that we add by hand. The examples have an input and an expected_output. The input is the user query, and the expected_output is the expected answer from the chatbot.

Note that for a serious evaluation you would want more than two examples to evaluate.

5. Run chatbot on dataset items

In order to evaluate the chatbot, we need to run the test cases that we saved in the dataset using the LLM model of the chatbot. We will add the chatbot’s answers and contexts to the item list. Run the input questions that you put in the test dataset on the chatbot. In this example, Chainlit is used. You can open the Chainlit chatbot and ask your questions there, in two separate chats. These threads will be visible in the Literal AI UI, under “Threads”.

Test Threads

Test Threads

Test Thread

Test Thread

Next, you want to add the chatbot’s answers and contexts to the list of test items that we created in the previous step. In the next step we can use this list to evaluate using Ragas. Make sure to use the Step ID from the LLM step in the Thread, and Step ID from the Retrieve step in the Thread.

6. Evaluate items

Next, we can evaluate the two generated answers on faithfulness and answer correctness. For evaluation, we use Ragas, a Python framework to evaluation RAG pipelines. Note that this evaluation is done outside of Literal AI (offline).

5. Analyze results

Finally, you can analyze the results of the evaluation. The results will be shown in the terminal.

Evaluation result using RAGAS

Evaluation result using RAGAS

You can see that both answers have the highest score for faithfulness, meaning that the chatbot’s answers are factually consistent given the context. The answer correctness is also high for the first results, but low for the second result. This corresponds to human evaluation if we look at the examples we created. Based on this result, you can decide to improve the chatbot’s model or prompt.

You can also see the experiment results in Literal AI, if you upload the experiment.

Evaluation result using RAGAS in Literal AI

Evaluation result using RAGAS in Literal AI