Skip to content

Concepts

Gilles Moyse edited this page Feb 7, 2025 · 2 revisions

Ragtime

Ragtime is an LLMOps framework which allows you to automatically:

  1. evaluate a Retrieval Augmented Generation (RAG) system

  2. compare different RAGs / LLMs

  3. generate Facts for automatic evaluation

Ragtime allows you to evaluate long answers not only multiple choice questions or counting common words between an answer and a baseline.

In Ragtime, a RAG is made of, optionally, a Retriever, and always, a Large Language Model (LLM).

  • The Retriever takes a question in input and returns one or several chunks or paragraphs retrieved from a documents knowledge base using the question
  • A LLM is a text to text generator taking in input a prompt, made of a question and optional chunks, and returning an Answer

Evaluating an LLM is not an easy task since it can generate an almost illimited number of answers to a given prompt or question, and that many answers can be right, but formulated differently. For example:

 Question: "Why is it important to drink water?"

Here are three correct answers:

Answer 1: Drinking water is essential because our bodies need it to function properly. Water helps regulate body temperature, carry nutrients to cells, and remove waste products. It also keeps our skin healthy and helps us feel energized.

Answer 2: Water is crucial for survival. Our bodies are made up mostly of water, and we lose it constantly through breathing, sweating, and going to the bathroom. Drinking water replaces what we lose, preventing dehydration and keeping our organs functioning properly.

Answer 3: Staying hydrated by drinking water is vital for good health. Water aids digestion, maintains blood volume, lubricates joints, and helps flush out toxins from our system. Without enough water, we can feel tired, get headaches, and our overall health can suffer.
 

The main idea in Ragtime is to evaluate answers returned by a RAG based on Facts. Indeed, it is very difficult to evaluate RAGs and/or LLMs because you cannot define a "good" answer. A LLM can return many equivalent answers expressed in different ways, making impossible a simple string comparison to determine whether an answer is right or wrong. Even though many proxies have been created, counting the number of common words like in ROUGE for instance is not very precise (see HuggingFace LightEval).

LLM as Judge is also commonly used (e.g. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena or Using LLM-as-a-judge for an automated and versatile evaluation, even though LLMs are used to compare the answer returned by an LLM with a ground truth. These approaches, though efficient in some cases, are difficult to verify. When the Judge is asked to return a score, it is not clear at all why this score should be 3 or 4 for instance.

In Ragtime, the ground truth is not a golden answer but a set of facts which have to be found in the answer to make it correct.

Facts are key ideas defining a correct answer. These facts must contain all the essential elements for an answer to be considered correct.

ragtime

Using Ragtime

We begin by setting the questions and their corresponding answers. The answers should be provided by an expert or someone knowledgeable in the subject to ensure that they contain all the necessary elements of a correct response.

Next, we take the database of Q&A pairs and transform it into a database of Q&F pairs (questions and their facts). This new dataset is referred to as the validation set.

The validation set allows us to evaluate models by asking the same questions and using a parent model (typically GPT-4 or Mistral-large) to compare the similarities between each fact and the model's answer.

Each fact is evaluated with one of four possible outcomes: “OK”, “NOT FOUND”, “HALLU” (hallucination), or “EXTRA” (additional information).

These evaluations are used to calculate the model's performance on the validation set, providing insights into the model’s quality on the specific subject represented by the dataset.

scheme

There is a second evaluation step called "Chunks Evaluation," where we check the presence of the facts in the context provided by the RAG system to the LLM. This step helps identify the source of errors in the answers and provides insights into which part of the system needs improvement.

In addition to basic performance statistics, Ragtime offers a detailed evaluation, going beyond the four standard outcomes, by explaining the differences between the facts and the ideas presented in the answers.

Clone this wiki locally