A quarter decade of learnings from scaling RAG to millions of users

Start here if you are building a RAG system. These are the summarized learnings from a quarter decade of building and scaling RAG architectures, summarized in a reusable design decision blueprint.

I still remember the spark in the audience's eyes when demoing the first RAG Q&A app at a developer meetup in early 2023. In my role at Google, I’ve seen RAG architectures develop significantly since then. Over time, I designed 50+ RAG applications rolled out to approximately 5M users. Of course, each RAG app is individual. Still, I worked out a couple of design patterns and decisions that always come up. This article pours these into a reusable RAG design decision blueprint. Stealing is encouraged!

Let’s start with the why

If you are a RAG veteran, skip this abstract.

Retrieval Augmented Generation (RAG) is the main method to provide context to language models. RAG increases factuality and allows the AI to include knowledge beyond its pre-training. Often, RAG is automatically equated with vector databases and text embeddings. While these are one popular method of retrieving context, many others do not include text embeddings. Simply said, RAG is any LLM prompt that is augmented by some type of context retrieval. You’re using RAG when asking a language model to answer a question after dynamically pre-selecting the relevant context from a larger knowledge base.

This is an example of a prototype RAG prompt. That’s it. The core concept of RAG is that simple!

SYSTEM: You are an intelligent assistant helping to answer questions related to a given knowledge base and provided images. 
Question: {question}

-Strictly Use ONLY the following pieces of context or the provided image input to answer the question at the end.
Think step-by-step and then answer.
Be specific in your answer and provide examples from the context.

=============
{context}
=============

Do not try to make up an answer:
 - If the answer to the question cannot be determined from the context or the provided image alone, say
 "I cannot determine the answer to that." and explain what information is missing to answer the question.
 - If the context is empty and there is no provided image, just say "I do not know the answer to that."

Question: {question}
Helpful & specific Answer:

Design Decision I: Building a knowledge base

Our first design decision concerns the data populating our knowledge base. When deciding which data sources to include in your knowledge base, the most important questions to consider are:

Which knowledge sources do we want to include (can be one or multiple)?
Do we manage the knowledge bases ourselves, or are they managed externally and available via e.g. an API?
Is the data structured, unstructured, or semi-structured?
What is the smallest unit into which we can break the data without losing context (chunking strategy)?

The starting point for every RAG architecture is a list of knowledge bases that need to be connected. This should be a list that is as exhaustive as possible in the first place. Often, we start with the implementation of the simplest ones that are under your control. Nevertheless, we should plan forward, considering which additional data sources could be connected in the future.

Our level of control over the knowledge bases often determines their implementation complexity. Take, for example, an external API as a context retrieval source. As it is externally managed, the implementation should be straightforward, but the level of control is low. That limits us, for example, in terms of improving performance. If we control the knowledge base ourselves, we can fully control the data pre-processing, indexing, and retrieval. On the other hand, the knowledge base maintenance is significantly increased.

Next, we need to understand the data format contained in the knowledge base to decide on a pre-processing strategy. Data format (especially structured vs. unstructured or semi-structured) significantly impacts recommended indexing methods. For example, structured data usually comes with the (user) requirement of asking analytical questions. Answering these questions reliably usually requires keeping the data in a structured format. Usually, text embeddings are useless here. Unstructured data often comes in longer documents. We cannot simply import the raw documents into our knowledge base. When querying our RAG system, we want to drill down to where the information we found is coming from, as specifically as possible (by abstract or sentence). To accomplish this, we split the original documents into smaller chunks, allowing us to search for relevant chunks instead of documents. Depending on the data type, different chunking strategies support effective knowledge transfer.

Popular chunking strategies involve:

Based on markdown or HTML tags
Recursive character splits
Fixed Token length
Row-based (structured data)
Frame-by-frame (video data)

RAG Data Pre-Processing Methods Image: overview of different data types that need varying pre-processing steps and storage systems.

Please STOP using text embeddings as the universal solution for all your problems. Every data format needs specific handling. While Text embeddings can be a powerful tool they can be a wood chopper that destroys any potential insight together with your search result quality.

Design Decision II: Retrieving the right content

After preparing our data, we need to store it. Let’s focus on choosing the right data storage. RAG knowledge bases are often stored in Vector DBs or Graph DBs for unstructured data, and relational DBs, DWHs, or document DBs for structured/semi-structured data. This can extend to any other type of information store you can think of. The indexing method (from step 1) and data store are, of course, highly interdependent.

Your database(s) also determine the retrieval methods available to your RAG system. For example, a Vector DB is exceptional at finding the right chunk from a given knowledge base to answer a specific factual question (via nearest neighbour search). However, for rather abstract or analytical questions that require reasoning across datapoints and documents, you are well advised to build at least parts of your query system on a Knowledge Graph or relational data structure.

Unsurprisingly, these interdependencies make the datastore the most crucial design decision in our planning process. The datastore decision and retrieval method should be based on the following questions:

Is our data structured, unstructured, or semi-structured?
Which types of user queries do we expect to receive?
How many queries per time do we expect to receive?
How critical is retrieval latency for our application?
What volume of knowledge are we storing per database?
Do we include external data, managed by 3rd parties (e.g. via API call)?

We differentiate user questions between extractive and aggregate/analytical questions. Extractive questions ask for a specific piece of knowledge likely included in the knowledge base. Aggregate & analytical questions require the application to aggregate across documents or datapoints. For example, extractive questions such as “Who won the Nobel Peace Prize in 2025?” are easily answered with a text embedding-based retrieval towards unstructured data. However, an aggregate question such as “Which Nobel Peace Prize winners do you know about?” is harder to answer correctly for the text-embedding-based system based on unstructured data. This is mostly because our knowledge base does not contain the answer to our question. Instead, our retrieval step needs to reason beyond extracted facts. Similarly, an analytical question such as “Which Nobel Peace Prize Winner, still alive, is the oldest?” towards structured data will be tough to answer using row-based-text embeddings. Instead, you will need to connect a Text2SQL system.

Here is an overview of the data and query types with their respective indexing method:

Data Types, Query Types and recommended indexing method Image: Relationship of Knowledge Data, Query Types, with their recommended indexing and retrieval methods.

Moving to the operational piece of the retrieval system building. Like any other Database, we must ensure that our weapon of choice supports our RAG application's scale and latency requirements. Depending on the database type, there are plenty of options available.

I’m slightly GCP opinionated, so here is a (non-exhaustive) overview of the GCP vector DB universe:

Firestore Vector Search: Lightweight, serverless, easy to get started
AlloyDB: Highly efficient combination of relational & vector DB
Vertex Vector Search: Heaviest ship, best latency on the market, powering Google & YouTube Search

For structured source data, many DWHs and relational databases offer vector index & hybrid search integrations:

BigQuery: Powerful serverless DWH with text embedding generation & indexing integrated
AlloyDB: Highly efficient combination of relational & vector DB

For Knowledge Graph Storage:

In many cases, I’ve seen RAG applications tap into self-managed and public knowledge bases. For example, I’m running a GCP architecture RAG solution that is supposed to access GCP documentation and GitHub repositories for sample code. Major challenge: The GCP documentation is highly dynamic, with numerous daily changes. Continuously indexing all pages to keep my knowledge base up to date would be a significant manual effort. The Google Programmable Search Engine API is a fantastic RAG source for dynamic, public data. A simple API call allows me to run a Google search, even on only a given set of indexed pages (I can define these). The query results can then be included in my RAG summary prompt. Any external API could serve as the knowledge source for your RAG system.

Knowledge bases need to be built around your knowledge. Not the other way around.

Design Decision III: Generating a meaningful response

After retrieving the right content for a given question, we should focus on answering the user queries. This step seems simple (if understood as a pure summary prompt). However, as the only user-facing step, it is at least as important as the previous ones. If we fail to generate an answer that creates value for your users, you can have the best retrieval system in the world, but nobody cares. Questions to consider before building the response generation are the following:

What type of user interface do you plan to offer (Q&A, conversational, GUI …)?
How complex is the average user query?
Is our RAG pipeline part of a wider system?

The simplest interface you can give your RAG application is a Q&A search bar. It’s straightforward to display the relevant documents or chunks combined with a summary to answer the users’ questions. However, many applications will rely on conversational or even graphical user interfaces (GUIs). In both alternatives, the RAG step could only be one component of a larger processing pipeline or multi-agent system. In many of these cases, the user queries are more complex than anticipated by your RAG design. Questions might be multi-part, requiring research on multiple knowledge bases and more complex output formats than anticipated.

These complex queries will likely show you the limits of a standard RAG pipeline. Depending on the query complexity and user interface, you must level-up your RAG system. For example, a Deep Research (DR) pipeline breaks down the initial user query into smaller, more digestible bits that your RAG pipeline can then research individually. Based on the individual research reports, DR merges the insights in a report customized to your user requirements. Here you can read up on the Deep Research design pattern and how to implement it.

An agentic loop is another popular option to improve your RAG and/or DR pipeline. Agentic reasoning especially shines when working with many potential knowledge sources, where not every source is relevant for each research question. The agentic loop can iterate over a given question and dynamically decide which knowledge bases to tap into to answer it. The agent could work as a “while loop” with a given exit condition for the research to be considered complete.

Putting things together

Every RAG system is unique, but following these guidelines will maximize your chances of success in building something useful.

There are no limits in the data you could connect. But often starting simple wins. Of course also for RAG “Garbage in, garbage out” is the golden rule to remember when selecting data to connect.

RAG systems need to be user-centric from the earliest conceptualization phase on. If you don’t think about the types of questions your users are likely going to ask, you will likely end up building a half-baked product that will be forgotten quickly.

Finally, system context matters. A simple RAG pipeline often brings you a good way. But once queries and knowledge bases get a little more complex you will quickly hit the ceiling. Think about how to make it easy to answer questions for your system. Deep Research and Agentic Loops can be helpful here.

Good luck!