Image by Author
Â
Large language models (LLMs) like ChatGPT and Llama excel at answering questions but are limited to the knowledge they were trained on. They can’t access private data or learn beyond their training cut-off. So, the main is… how can we extend their knowledge?
The answer lies in retrieval-augmented generation (RAG). Today we will explore the RAG pipeline and demonstrate how to build one using the LLama Index.
Let’s get started!
Â
Retrieval Augmented Generation: The Basics
Â
LLMs are the most advanced NLP models today, excelling in translation, writing, and general Q&A. However, they struggle with domain-specific queries, often generating hallucinations.
In such cases, only a few documents may contain relevant context per query. To address this, we need a streamlined system that efficiently retrieves and integrates relevant information before generating responses — this is the essence of RAG.
Pre-trained LLMs acquire knowledge through three main approaches, each with limitations:
- Training: Building an LLM from scratch requires training massive neural networks on trillions of tokens, costing hundreds of millions of dollars—making it infeasible for most
- Fine-tuning: This adapts a pre-trained model to new data but is costly in time and resources. Unless there’s a specific need, it’s not always practical
- Prompting: The most accessible approach, prompting inserts new information into an LLM’s context window, enabling it to answer queries based on the provided data; However, since document sizes often exceed context limits, this method alone isn’t enough
RAG overcomes these limitations by efficiently processing, storing, and retrieving relevant document segments at query time. This ensures that LLMs generate more accurate, context-aware responses without requiring expensive retraining or fine-tuning.
Â
Key Components of a RAG Pipeline
Â
A RAG system consists of several essential components:
Â

Image by Author (click to enlarge)
Â
- Text Splitter: Breaks down large documents into smaller chunks that fit within an LLM’s context window
- Embedding Model: Converts text into vector representations, enabling efficient similarity searches
- Vector Store: A specialized database that stores and retrieves document embeddings along with metadata
- LLM: The core language model that generates answers based on retrieved information
- Utility Functions: Includes tools like web retrievers and document parsers to preprocess and enhance data retrieval
Each of these components plays a crucial role in making RAG systems accurate and efficient.
Â
What is LlamaIndex?
Â
LlamaIndex (formerly GPTIndex) is a Python framework designed for building LLM-powered applications. It acts as a bridge between custom data sources and large language models, streamlining data ingestion, indexing, and querying.
With built-in support for various data sources, vector databases, and query interfaces, LlamaIndex serves as an all-in-one solution for RAG applications. It also integrates seamlessly with tools like LangChain, Flask, and Docker, making it highly flexible for real-world implementations.
Explore LlamaIndex’s official GitHub repository here.
Â
Implementing a Simple RAG System with LlamaIndex
Â
Step 1: Set Up the Environment
Before diving into the implementation, we need to set up our Python environment and install the necessary dependencies. Using a virtual environment helps manage dependencies efficiently:
python -m venv rag_env
source rag_env/bin/activate # On Windows, use: rag_env\Scripts\activate
Â
Now you can install the required libraries. LlamaIndex, OpenAI, and FAISS are essential for building our RAG system:
pip install llama-index openai faiss-cpu
Â
To enable LlamaIndex to query an OpenAI model, don’t forget to configure your OpenAI API key:
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
Â
Step 2: Load Documents
For retrieval to work, we first need to load documents into the system. LlamaIndex provides the SimpleDirectoryReader to handle this process efficiently. In my case I will use the Attention Is All You Need paper to extend the knowledge of my LLM.
from llama_index import SimpleDirectoryReader
# Load text files from a directory
documents = SimpleDirectoryReader("./data").load_data()
print(f"Loaded {len(documents)} documents")
Â
Step 3: Create Text Splits
LLMs have a context window limitation, so we can’t pass entire documents at once. Instead, we split them into smaller, structured chunks for efficient retrieval.
from llama_index.text_splitter import SentenceSplitter
# Define a sentence-based text splitter
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
# Apply text splitting to documents
nodes = text_splitter.split_text([doc.text for doc in documents])
print(f"Split into {len(nodes)} chunks")
Â
Step 4: Index Documents with Embeddings
To perform semantic search, we must convert our document chunks into vector embeddings and store them in an index.
from llama_index import VectorStoreIndex
# Create an index
index = VectorStoreIndex(nodes)
# Persist the index (optional)
index.storage_context.persist(persist_dir="./storage")
Â
Step 5: Query the Index with RAG
This is where RAG (finally) comes into play. We will query the indexed documents to retrieve relevant information and generate an LLM-powered response.
from llama_index.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(index.as_retriever())
response = query_engine.query("What is attention?")
print(response)
Â
If we execute the above, we obtain the following:
Attention is a mechanism used in deep learning models to focus on relevant parts of the input sequence while processing data. In the paper 'Attention Is All You Need,' Vaswani et al. introduced the Transformer architecture, which relies entirely on self-attention mechanisms instead of recurrence or convolution. The key innovation is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence relative to each other, enabling better parallelization and long-range dependencies.
Â
We did it!
Final Thoughts
Â
Building a RAG system with LlamaIndex opens up exciting possibilities for leveraging LLMs beyond their training data. By integrating document retrieval, embedding-based indexing, and real-time querying, RAG enhances accuracy and reduces hallucinations, making it a powerful solution for domain-specific applications.
With the step-by-step implementation in this guide, you now have a functional RAG pipeline that can be expanded in several ways, which we leave as exercises for the reader:
- Customizing embeddings with models like OpenAI, Cohere, or Hugging Face
- Integrating vector databases such as Pinecone, Weaviate, or ChromaDB for scalable retrieval
- Deploying the system via APIs using Flask, FastAPI, or a chatbot interface
- Optimizing text chunking strategies to improve retrieval quality
Now it’s your turn — experiment, iterate, and push the boundaries of what’s possible with LlamaIndex!
Â
Â
Josep Ferrer is an analytics engineer from Barcelona. He graduated in physics engineering and is currently working in the data science field applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes on all things AI, covering the application of the ongoing explosion in the field.