Retrieval-Augmented-Generation: RAG systems and data indepth.

December 30, 2024

Systematically Improving Your RAG

This article explains how to make Retrieval-Augmented Generation (RAG) systems better. It's based on the studies and research and builds on other articles I've written about RAG.

Overview

By the end of this post, you'll understand my step-by-step approach to making RAG applications better. Key areas include:

  1. Making synthetic questions to test system performance.
  2. Combining full-text search and vector search for optimal results.
  3. Setting up feedback mechanisms.
  4. Grouping and analyzing problematic queries.
  5. Building targeted systems to improve capabilities.
  6. Iterative testing with real-world data.

Let’s dive in!

Start with Synthetic Data

The biggest mistake in improving systems is spending too much time on synthesis without verifying data retrieval accuracy.

Steps:

  1. Generate synthetic questions for each text chunk in your database.
  2. Use these questions to test retrieval.
  3. Calculate precision and recall scores to establish a baseline.
  4. Identify areas for improvement based on baseline scores.

Key Insight

Synthetic data should achieve ~97% recall precision. For example:

  • Full-text search performs as well as embeddings for essays but is ~10x faster.
  • For repository issues, recall rates:
    • Full-text: ~55%
    • Embeddings: ~65%

Use these benchmarks to refine experiments and identify challenges.

Utilize Metadata

Metadata improves search accuracy by enhancing query understanding.

Steps:

  1. Extract relevant metadata (e.g., dates, file names, ownership).
  2. Include metadata in your search indexes.
  3. Expand user queries with metadata for better results.

Example

A query like "What is the latest X?" requires query understanding to extract date ranges. Without metadata, neither full-text nor semantic search can provide accurate answers.

Combining both methods offers the best retrieval results.

Steps:

  1. Implement full-text search and vector search.
  2. Test performance on your use case.
  3. Use a single database system to avoid synchronization issues.

Insights

  • Full-text search: Faster but less recall.
  • Vector search: Higher recall but slower.
  • Use tools supporting both in a single object for seamless integration.

Implement Clear User Feedback Mechanisms

Feedback is crucial for identifying and prioritizing improvements.

Steps:

  1. Add feedback mechanisms (e.g., thumbs up/down).
  2. Ensure feedback prompts are clear (e.g., "Did we answer your question correctly?").
  3. Use feedback data to build evaluation datasets and prioritize fixes.

Key Takeaway

Explicit prompts reduce confounding variables and improve data quality.

Cluster and Model Topics

Analyze queries and feedback to identify patterns and prioritize improvements.

Steps:

  1. Group queries into topic clusters and identify capability gaps.
  2. Use clusters to prioritize updates and feature development.
  3. Tag incoming data with topics and capabilities for better analysis.

Example

  • Topic Clusters: Users frequently ask about updated product features.
  • Capability Gaps: Users want troubleshooting steps or error explanations.

Continuously Monitor and Experiment

Regular monitoring and experimentation are essential for ongoing improvement.

Steps:

  1. Set up logging to track performance trends.
  2. Design experiments to test potential changes.
  3. Measure impact on precision, recall, and other metrics.

Example

When onboarding new clients, monitor shifts in query distributions and prioritize new capabilities as needed.

Balance Latency and Performance

Make informed trade-offs based on use case requirements.

Steps:

  1. Measure latency vs. recall impact.
  2. Prioritize based on user needs (e.g., medical diagnostics vs. general search).
  3. Adjust configurations accordingly.

Key Insight

A 1% recall improvement may be worthwhile for high-stakes applications but not for general search.

Wrapping Up

This guide provides actionable steps to improve RAG systems incrementally. For more details or specific implementation advice, leave a comment or reach out!

Explore different levels of RAG complexity, from basic techniques to advanced methods. Topics include:

  • Basic text processing and embeddings.
  • Efficient data storage and retrieval.
  • Advanced search algorithms.
  • Observability and evaluation strategies.

For a detailed breakdown, check out our article on RAG flywheels and their role in continuous system improvement.

How to build a terrible RAG system

If you've followed my work on RAG systems, you'll know I emphasize treating them as recommendation systems at their core. In this post, we'll explore the concept of inverted thinking to tackle the challenge of building an exceptional RAG system.

What is inverted thinking?

Inverted thinking is a problem-solving approach that flips the perspective. Instead of asking, "How can I build a great RAG system?", we ask, "How could I create the worst possible RAG system?" By identifying potential pitfalls, we can more effectively avoid them and build towards excellence.

This approach aligns with our broader discussion on RAG systems, which you can explore further in our RAG flywheel article and our comprehensive guide on Levels of Complexity in RAG Applications.

If you want to learn more about I systematically improve RAG applications check out my free 6 email improving rag crash course

Inventory

You'll often see me use the term inventory. I use it to refer to the set of documents that we're searching over. It's a term that I picked up from the e-commerce world. It's a great term because it's a lot more general than the term corpus. It's also a lot more specific than the term collection. It's a term that can be used to refer to the set of documents that we're searching over, the set of products that we're selling, or the set of items that we're recommending.

Don't worry about latency

There must be a reason that chat GPT tries to stream text out. Instead, we should only show the results once the entire response is completed. Many e-commerce websites have found that 100 ms improvement in latency can increase revenue by 1%. Check out How One Second Could Cost Amazon $1.6 Billion In Sales.

Don't show intermediate results

Users love love staring at a blank screen. It's a great way to build anticipation. If we communicated intermittent steps like the ones listed below, we'd just be giving away the secret sauce and users prefer to be left in the dark about what's going on.

Understanding your question

Searching with "..." Finding the answer Generating response Don't Show Them the Source Document Never show the source documents, and never highlight the origin of the text used to generate the response. Users should never have to fact-check our sources or verify the accuracy of the response. We should assume that they trust us and that there is no risk of false statements.

We Should Not Worry About Churn

We are not building a platform; we are just developing a machine learning system to gather metrics. Instead of focusing on churn, we should concentrate on the local metrics of our machine learning system like AUC and focus on benchmarks on HuggingFace.

We Should Use a Generic Search Index

Rather than asking users or trying to understand the types of queries they make, we should stick with a generic search and not allow users to generate more specific queries. There is no reason for Amazon to enable filtering by stars, price, or brand. It would be a waste of time! Google should not separate queries into web, images, maps, shopping, news, videos, books, and flights. There should be a single search bar, and we should assume that users will find what they're looking for.

We Should Not Develop Custom UI

It doesn't make sense to build a specific weather widget when the user asks for weather information. Instead, we should display the most relevant information. Semantic search is flawless and can effectively handle location or time-based queries. It can also re-rank the results to ensure relevance.

We Should Not Fine-Tune Our Embeddings

A company like Netflix should have a generic movie embedding that can be used to recommend movies to people. There's no need to rely on individual preferences (likes or dislikes) to improve the user or movie embeddings. Generic embeddings that perform well on benchmarks are sufficient for building a product.

We Should Train an LLM

Running inference on a large language model locally, which scales well, is cost-effective and efficient. There's no reason to depend on OpenAI for this task. Instead, we should consider hiring someone and paying them $250k a year to figure out scaling and running inference on a large language model. OpenAI does not offer any additional convenience or ease of use. By doing this, we can save money on labor costs.

We Should Not Manually Curate Our Inventory

There's no need for manual curation of our inventory. Instead, we can use a generic search index and assume that the documents we have are relevant to the user's query. Netflix should not have to manually curate the movies they offer or add additional metadata like actors and actresses to determine which thumbnails to show for improving click rates. The content ingested on day one is sufficient to create a great recommendation system.

We Should Not Analyze Inbound Queries

Analyzing the best and worst performing queries over time or understanding how different user cohorts ask questions will not provide any valuable insights. Looking at the data itself will not help us generate new ideas to improve specific segments of our recommendation system. Instead, we should focus on improving the recommendation system as a whole and avoid specialization.

Imagine if Netflix observed that people were searching for "movies with Will Smith" and decided to add a feature that allows users to search for movies with Will Smith. That would be a waste of time. There's no need to analyze the data and make system improvements based on such observations.

Machine Learning Engineers Should Not Be Involved in Ingestion

Machine Learning Engineers (MLEs) do not gain valuable insights by examining the data source or consulting domain experts. Their role should be limited to working with the given features. Theres no way that MLEs who love music would do a better job at Spotify, or a MLE who loves movies would do a better job at Netflix. Their only job is to take in data and make predictions.

We Should Use a Knowledge Graph

Our problem is so unique that it cannot be handled by a search index and a relational database. It is unnecessary to perform 1-2 left joins to answer a single question. Instead, considering the trending popularity of knowledge graphs on Twitter, it might be worth exploring the use of a knowledge graph for our specific case.

We should treat all inbound inventory the same

There's no need to understand the different types of documents that we're ingesting. How different could marketing content, construction documents, and energy bills be? Just because some have images, some have tables, and some have text doesn't mean we should treat them differently. It's all text, and so an LLM should just be able to handle it.

We should not have to build special ingestion pipelines

GPT-4 has solve all of data processing so if i handle a photo album, a pdf, and a word doc, it should be able to handle any type of document. There's no need to build special injestion pipelines for different types of documents. We should just assume that the LLM will be able to handle it. I shouldn't dont even have to think about what kinds of questions I need to answer. I should just be able to ask it anything and it should be able to answer it.

We should never have to ask the data provider for clean data

If Universal studios gave Netflix a bunch of MOV files with no metadata, Netflix should not have to ask Universal studios to provide additional movie metadata. Universal might not know the runtime, or the cast list and its netflix's job to figure that out. Universal should not have to provide any additional information about the movies they're providing.

We should never have to cluster our inventory

Theres only one kind of inventory and one kind of question. We should just assume that the LLM will be able to handle it. I shouldn't dont even have to think about what kinds of questions I need to answer. Topic clustering would only show us how uniform our inventory is and how little variation there is in the types of questions that users ask.

We should focus on local evals and not A/B tests

Once we run our GPT-4 self critique evaluations we'll know how well our system is doing and it'll make us more money, We should spend most of our time writing evaluation prompts and measuring precision / recall and just launching the best one. A/B tests are a waste of time and we should just assume that the best performing prompt will be the best performing business outcome.

10 Ways to Be Data Illiterate and How to Avoid Them

Data literacy is an essential skill in today's data-driven world. As AI engineers, understanding how to properly handle, analyze, and interpret data can make the difference between success and failure in our projects. In this post, we will explore ten common pitfalls that lead to data illiteracy and provide actionable strategies to avoid them. By becoming aware of these mistakes and learning how to address them, you can enhance your data literacy and ensure your work is both accurate and impactful. Let's dive in and discover how to navigate the complexities of data with confidence and competence.

Ignoring Data Quality

Data quality is the foundation upon which all analyses and models are built. Failing to assess and address issues like missing values, outliers, and inconsistencies can lead to unreliable insights and poor model performance. Data literate AI engineers must prioritize data quality to ensure their work is accurate and trustworthy. Inversion: Assess and address data quality issues before analyzing data or building models.

  • Conduct exploratory data analysis (EDA) to identify potential quality issues
  • Develop and implement data cleaning and preprocessing pipelines
  • Establish data quality metrics and monitor them regularly

Not Visualizing the Data

Not visualizing your data can lead to missed insights, poor understanding of patterns and relationships, and poor communication of findings to others. AI engineers must learn the basics of visualizing data to better understand it, grok it, and communicate it. Inversion: Learn how to visualize data to explore, understand, and communicate the data.

  • Start with basic visualizations, such as histograms and box plots to understand distributions
  • Then, consider advanced techniques such as PCA or t-SNE to discover complex patterns
  • Don't let the visual hang on its own—provide a logical narrative to guide the reader through it.

Only Relying on Aggregate Statistics

Aggregate statistics such as mean and median can obscure important patterns, outliers, and subgroup differences within the data. AI engineers should understand the limitations of summary statistics lest they fall to Simpson's paradox. Inversion: Dive deeper into the data by examining distributions, subgroups, and individual observations, in addition to aggregate statistics.

*Consider statistics such as standard deviation, median vs. mean, and quantiles to get a sense of the data *Use histograms and density plots to identify skewness, multimodality, and potential outliers *Combine insights from aggregate statistics, distributions, subgroups to develop an understanding of the data

Lack of Domain Understanding

Analyzing data without sufficient context can result in misinterpretations and irrelevant or impractical insights. AI engineers must develop a deep understanding of the domain they are working in to ensure their analyses and models are meaningful and applicable to real-world problems. Inversion: Develop a strong understanding of the domain and stakeholders before working with data.

  • Engage with domain experts and stakeholders to learn about their challenges and goals
  • Read relevant literature and attend industry conferences to stay up-to-date on domain trends
  • Participate in domain-specific projects and initiatives to gain hands-on experience

Improper Testing Splits

Inappropriately splitting data can lead to biased or overly optimistic evaluations of model performance. Data literate AI engineers must use appropriate techniques like stratification and cross-validation to ensure their models are properly evaluated and generalizable. Inversion: Use appropriate data splitting techniques to ensure unbiased and reliable model evaluations.

  • Use stratified sampling to ensure balanced representation of key variables in train/test splits
  • Employ cross-validation techniques to assess model performance across multiple subsets of data
  • Consider time-based splitting for time-series data to avoid leakage and ensure temporal validity

Disregarding Data Drift

Ignoring changes in data distribution over time can cause models to become less accurate and relevant. AI engineers must be aware of the potential for data drift and take steps to monitor and address it, such as regularly evaluating model performance on new data and updating models as needed. Inversion: Monitor and address data drift to maintain model accuracy and relevance over time.

  • Implement data drift detection methods, such as statistical tests or model-based approaches
  • Establish a schedule for regularly evaluating model performance on new data
  • Develop strategies for updating models, such as retraining or incremental learning, when drift is detected

Confusing Correlation with Causation

Mistaking correlations for causal relationships can lead to incorrect conclusions and poor decision-making. Data literate AI engineers must understand the limitations of correlational analyses and use appropriate techniques like experimentation and causal inference to establish causal relationships. Inversion: Understand the difference between correlation and causation, and use appropriate techniques to establish causal relationships.

  • Use directed acyclic graphs (DAGs) to represent and reason about causal relationships
  • Employ techniques like randomized controlled trials (RCTs) or natural experiments to establish causality
  • Apply causal inference methods, such as propensity score matching or instrumental variables, when RCTs are not feasible

Neglecting Data Privacy and Security

Mishandling sensitive data can breach trust, violate regulations, and harm individuals. AI engineers must prioritize data privacy and security, following best practices and regulations to protect sensitive information and maintain trust with stakeholders. Inversion: Prioritize data privacy and security, following best practices and regulations.

  • Familiarize yourself with relevant data privacy regulations, such as GDPR or HIPAA
  • Implement secure data storage and access controls, such as encryption and role-based access
  • Conduct regular privacy impact assessments and security audits to identify and address vulnerabilities

Overfitting Models

Building overly complex models that memorize noise instead of learning generalizable patterns can limit a model's ability to perform well on new data. Data literate AI engineers must use techniques like regularization, cross-validation, and model simplification to prevent overfitting and ensure their models are robust and generalizable. Inversion: Use techniques to prevent overfitting and ensure models are robust and generalizable.

  • Apply regularization techniques, such as L1/L2 regularization or dropout, to constrain model complexity
  • Use cross-validation to assess model performance on unseen data and detect overfitting
  • Consider model simplification techniques, such as feature selection or model compression, to reduce complexity

Unfamiliarity with Evaluation Metrics

Misunderstanding or misusing evaluation metrics can lead to suboptimal model selection and performance. AI engineers must have a deep understanding of various evaluation metrics and their appropriate use cases to ensure they are selecting the best models for their specific problems.

Inversion: Develop a strong understanding of evaluation metrics and their appropriate use cases.

  • Learn about common evaluation metrics, such as accuracy, precision, recall, and F1-score, and their trade-offs.
  • Understand the implications of class imbalance and how it affects metric interpretation.
  • Select evaluation metrics that align with the specific goals and constraints of your problem domain.

Ignoring Sampling Bias

Failing to account for sampling bias can lead to models that perform poorly on underrepresented groups and perpetuate inequalities. Data-literate AI engineers must be aware of potential sampling biases and use techniques like stratified sampling and oversampling to ensure their models are fair and inclusive.

Inversion: Be aware of sampling bias and use techniques to ensure models are fair and inclusive.

  • Analyze the representativeness of your data and identify potential sampling biases.
  • Use stratified sampling to ensure balanced representation of key demographic variables.

Disregarding Interpretability and Explainability

Focusing solely on performance without considering the ability to understand and explain model decisions can limit trust and accountability. AI engineers must prioritize interpretability and explainability, using techniques like feature importance analysis and model-agnostic explanations to ensure their models are transparent and understandable. Inversion: Prioritize interpretability and explainability to ensure models are transparent and understandable.

  • Use interpretable models, such as decision trees or linear models, when appropriate
  • Apply feature importance analysis to understand the key drivers of model predictions
  • Employ model-agnostic explanation techniques, such as SHAP or LIME, to provide insights into individual predictions By avoiding these ten common pitfalls and embracing their inversions, AI engineers can develop strong data literacy skills and create reliable, effective, and responsible AI systems. Data literacy is an essential competency for AI engineers, enabling them to navigate the complex landscape of data-driven decision-making and model development with confidence and integrity.

Levels of Complexity: RAG Applications

This guide explores different levels of complexity in Retrieval-Augmented Generation (RAG) applications. We'll cover everything from basic ideas to advanced methods, making it useful for beginners and experienced developers alike.

We'll start with the basics, like breaking text into chunks, creating embeddings, and storing data. Then, we'll move on to more complex topics such as improved search methods, creating structured responses, and making systems work better. By the end, you'll know how to build strong RAG systems that can answer tricky questions accurately.

As we explore these topics, we'll use ideas from other resources, like our articles on data flywheels and improving tool retrieval in RAG systems. These ideas will help you understand how to create systems that keep improving themselves, making your product better and keeping users more engaged.

Key topics we'll explore include:

  1. Basic text processing and embedding techniques
  2. Efficient data storage and retrieval methods
  3. Advanced search and ranking algorithms
  4. Asynchronous programming for improved performance
  5. Observability and logging for system monitoring
  6. Evaluation strategies using synthetic and real-world data
  7. Query enhancement and summarization techniques This guide aligns with the insights from our RAG flywheel article, which emphasizes the importance of continuous improvement in RAG systems through data-driven iterations and user feedback integration.

Level 1: The Basics

Welcome to the foundational level of RAG applications! Here, we'll start with the basics, laying the groundwork for your journey into the realm of Retrieval-Augmented Generation. This level is designed to introduce you to the core concepts and techniques essential for working with RAG models. By the end of this section, you'll have a solid understanding of how to traverse file systems for text generation, chunk and batch text for processing, and interact with embedding APIs. Let's dive in and explore the exciting capabilities of RAG applications together!

  1. Recursively traverse the file system to generate text.
  2. Utilize a generator for text chunking.
  3. Employ a generator to batch requests and asynchronously send them to an embedding API.
  4. Store data in LanceDB.
  5. Implement a CLI for querying, embedding questions, yielding text chunks, and generating responses.

Level 2: More Structured Processing

Here we delve deeper into the world of Retrieval-Augmented Generation (RAG) applications. This level is designed for those who have grasped the basics and are ready to explore more advanced techniques and optimizations. Here, we focus on enhancing the efficiency and effectiveness of our RAG applications through better asynchronous programming, improved chunking strategies, and robust retry mechanisms in processing pipelines.

In the search pipeline, we introduce sophisticated methods such as better ranking algorithms, query expansion and rewriting, and executing parallel queries to elevate the quality and relevance of search results.

Furthermore, the answering pipeline is refined to provide more structured and informative responses, including citing specific text chunks and employing a streaming response model for better interaction.

  1. Processing
  2. Better Asyncio
  3. Better Chunking
  4. Better Retries
  5. Search
  6. Better Ranking (Cohere)
  7. Query Expansion / Rewriting
  8. Parallel Queries

Answering

  1. Citating specific text chunks
  2. Streaming Response Model for better structure.
Level 3: Observability

At Level 3, the focus shifts towards the critical practice of observability. This stage emphasizes the importance of implementing comprehensive logging mechanisms to monitor and measure the multifaceted performance of your application. Establishing robust observability allows you to swiftly pinpoint and address any bottlenecks or issues, ensuring optimal functionality. Below, we outline several key types of logs that are instrumental in achieving this goal.

Expanding on Wide Event Tracking

Wide event tracking

  • Do it wide
Log how the queries are being rewritten
  1. When addressing a complaint we should quickly understand if the query was written correctly

example: once we found that for queries with "latest" the dates it was selecting was literally the current date, we were able to quickly fix the issue by including few shot examples that consider latest to be 1 week or more.

  1. Training a model We can also use all the positive examples to figure out how to train a model that does query expansion better.

Log the citations By logging the citations, we can quickly understand if the model is citing the correct information, what text chunks are popular, review and understand if the model is citing the correct information. and also potentially build a model in the future that can understand what text chunks are more important.