How To Build an Advanced RAG System with LangChain? | by Manika Nagpal | ProjectPro | Mar, 2025


ProjectPro

This tutorial will guide you through building an AI-powered retrieval-based question-answering (QA) system using LangChain, Groq API, LlamaParse, and Qdrant. The system will process a financial report (Meta’s First Quarter 2024 Results) and allow users to query relevant information efficiently. Here is a quick overview of what you will learn:

  • Parse documents using LlamaParse for structured text extraction.
  • Generate embeddings with FastEmbed for semantic search.
  • Store and retrieve data efficiently using Qdrant.
  • Enhance responses with similarity search, contextual compression, and Flashrank reranking.
  • Leverage ChatGroq for an improved QA experience.
Photo by Steve Johnson on Unsplash

By the end, you’ll have a smart system answering finance-related queries with precision.

Before diving into the implementation, ensure you have the necessary dependencies installed. The following commands install the required libraries:

!pip -qqq install pip - progress-bar off
!pip -qqq install langchain-groq==0.1.3 - progress-bar off
!pip -qqq install langchain==0.1.17 - progress-bar off
!pip -qqq install llama-parse==0.1.3 - progress-bar off
!pip -qqq install qdrant-client==1.9.1 - progress-bar off
!pip -qqq install "unstructured[md]"==0.13.6 - progress-bar off
!pip -qqq install fastembed==0.2.7 - progress-bar off
!pip -qqq install flashrank==0.2.4 - progress-bar off

These libraries enable us to:

  • LangChain: Build and manage AI agent workflows.
  • LlamaParse: Parse and extract data from PDFs.
  • Qdrant: Store and search vectorized document embeddings.
  • Flashrank: Improve retrieval with ranking models.
  • FastEmbed: Generate document embeddings efficiently.

We will explore these in detail when we import them in step-2.

We need to set up the Groq API key for interacting with the language model.

import os

os.environ["GROQ_API_KEY"] = input("Enter your GROQ API Key: ")

We import necessary modules for text processing and retrieval-based Q&A.

import textwrap
from pathlib import Path
from google.colab import userdata
from IPython.display import Markdown
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Qdrant
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
from llama_parse import LlamaParse

These libraries help in:

  1. Document Processing & Retrieval
  • UnstructuredMarkdownLoader loads and processes markdown documents.
  • RecursiveCharacterTextSplitter splits documents into smaller chunks for better retrieval.
  • Qdrant stores and retrieves document embeddings efficiently.

2) Text Embedding & Compression

  • FastEmbedEmbeddings converts text into vector representations for retrieval.
  • FlashrankRerank reranks retrieved documents based on relevance.

3) Q&A System & Language Model

  • RetrievalQA creates a question-answering system using retrieved documents.
  • ChatGroq integrates Groq’s LLM for generating answers.

4) Utilities

  • Markdown displays formatted responses.
  • Path handles file paths conveniently.

This step ensures we have all the necessary tools for processing, storing, retrieving, and interacting with documents in an AI-powered Q&A system.

In this step, we download Meta’s Q1 2024 earnings report and process it using LlamaParse to extract relevant financial data.

  1. We first create a data directory and download the earnings report using gdown:
!mkdir data
!gdown 1ee-BhQiH-S9a2IkHiFbJz9eX_SfcZ5m9 -O "data/meta-earnings.pdf"
  1. To ensure we extract only relevant financial data, we provide a custom parsing instruction:
instruction = """The provided document is Meta First Quarter 2024 Results.
This form provides detailed financial information about the company's performance for a specific quarter.
It includes unaudited financial statements, management discussion and analysis, and other relevant disclosures required by the SEC.
It contains many tables.
Try to be precise while answering the questions"""

This instruction guides LlamaParse to focus on financial statements, management analysis, and key disclosures, while ignoring unnecessary content.

Now, we initialize LlamaParse with an API key and use it to parse the PDF:

parser = LlamaParse(
api_key="LLAMA API KEY",
result_type="markdown",
parsing_instruction=instruction,
max_timeout=5000,
)
llama_parse_documents = await parser.aload_data("./data/meta-earnings.pdf")

Key Parameters

  • result_type=”markdown” → parses the content into a Markdown format.
  • max_timeout=5000 → allows more time for processing large documents.
  • parsing_instruction=instruction → ensures relevant content is extracted.

After parsing, we extract the text and save it as a Markdown file:

parsed_doc = llama_parse_documents[0]
#Markdown(parsed_doc.text[:4096])
document_path = Path("data/parsed_document.md")
with document_path.open("a") as f:
f.write(parsed_doc.text)

This step ensures we have the document’s structured financial information saved for further processing.

Once we have extracted the Meta Q1 2024 earnings report, we need to split the document into smaller, manageable chunks for efficient retrieval and embedding.

We use UnstructuredMarkdownLoader to load the extracted text from the Markdown file:

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
loader = UnstructuredMarkdownLoader(document_path)
loaded_documents = loader.load()

We do this because the extracted financial report is too large to process efficiently in a single chunk. Additionally, loading it as a structured document ensures it can be split properly.

We next use RecursiveCharacterTextSplitter to divide the document into smaller sections:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=128)
docs = text_splitter.split_documents(loaded_documents)

Key Parameters

  • chunk_size=2048 → Each chunk contains 2048 characters for optimal processing.
  • chunk_overlap=128 → Ensures context is preserved between adjacent chunks.

We then check the total number of chunks created:

len(docs)
print(docs[0].page_content)

Next, to enable semantic search and retrieval, we generate vector embeddings using FastEmbedEmbeddings:

embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")
  • Embeddings convert text into numerical vectors, making it easier for the model to understand meaning and context.
  • BAAI/bge-base-en-v1.5 is a powerful embedding model optimized for English text retrieval.

We now store the embedded chunks in Qdrant, a high-performance vector database for fast retrieval:

qdrant = Qdrant.from_documents(
docs,
embeddings,
# location=":memory:",
path="./db",
collection_name="document_embeddings",
)

Key Parameters

  • path=”./db” → saves the database locally to avoid re-processing.
  • collection_name=”document_embeddings” → organizes stored vectors for easy retrieval.

Now that we’ve stored document embeddings in Qdrant, we can query the database to find relevant information.

We use similarity search to retrieve the most relevant sections from the document based on the query.

query = "What is the most important innovation from Meta?"
similar_docs = qdrant.similarity_search_with_score(query)
  • The query is converted into an embedding using the same model we used earlier (BAAI/bge-base-en-v1.5).
  • Qdrant searches for vectors that are closest to the query in high-dimensional space.
  • The retrieved documents are returned with similarity scores.

To analyze the results, we loop through the retrieved documents and print their text snippet and similarity score:

for doc, score in similar_docs:
print(f"text: {doc.page_content[:256]}\n")
print(f"score: {score}")
print("-" * 80)
print()
  • A higher score means a better match.
  • A low score may indicate the query is not well covered in the document.

Next, we then create a retriever that fetches the top 5 most relevant results:

retriever = qdrant.as_retriever(search_kwargs={"k": 5})
retrieved_docs = retriever.invoke(query)

We use a retriever because it automatically picks the top 5 most relevant results, instead of manually specifying a similarity threshold. Also, they are more efficient for question-answering tasks.

Now, using the code below, we print the metadata of each retrieved document (e.g., an _id to track its source):

for doc in retrieved_docs:
print(f"id: {doc.metadata['_id']}\n")
print(f"text: {doc.page_content[:256]}\n")
print("-" * 80)
print()

This helps trace back to the original document source and is useful for debugging retrieval performance.

Once we retrieve relevant documents, we further rerank and refine them using Flashrank to improve accuracy.

What is Contextual Compression?

  • Problem: The retrieved documents may still contain irrelevant or less useful information.
  • Solution: Flashrank improves results by reranking documents based on relevance to the query.

We first initialize the Flashrank Reranker and use it with the ms-marco-MiniLM-L-12-v2 model:

compressor = FlashrankRerank(model="ms-marco-MiniLM-L-12-v2")
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
  • Base Retriever fetches top results (Step 5 output).
  • Flashrank re-evaluates and reranks them based on query relevance.
  • Higher semantic similarity = Higher ranking.

We then retrieve and rerank the documents and fetch the refined results:

reranked_docs = compression_retriever.invoke(query)
len(reranked_docs)
  • Compression retriever processes retrieved docs.
  • Ranks them based on query relevance score.
  • Returns a more focused list of results.

Finally, we display the reranked document IDs, content snippets, and relevance scores as it helps analyze ranking quality.

for doc in reranked_docs:
print(f"id: {doc.metadata['_id']}\n")
print(f"text: {doc.page_content[:256]}\n")
print(f"score: {doc.metadata['relevance_score']}")
print("-" * 80)
print()
  • Higher score = Better contextual match.

Now that we have retrieved, ranked, and compressed the most relevant documents, we can build a Q&A system using Groq’s Llama3–70B model.

We use ChatGroq with a temperature=0 to ensure deterministic (fact-based) responses:

llm = ChatGroq(temperature=0, model_name="llama3–70b-8192")
  • Llama3–70B is large-scale model optimized for factual accuracy.
  • 8192 token context window for processing long-form documents.
  • Temperature = 0 ensures precise, reliable answers (no hallucinations).

We define a prompt template that structures how the LLM processes retrieved documents:

prompt_template = """
Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Answer the question and provide additional helpful information,
based on the pieces of information, if applicable. Be succinct.
Responses should be properly formatted to be easily read.
"""
prompt = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
We connect the Groq LLM with our retriever to answer user questions:

qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=compression_retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt, "verbose": True},
)

  • Retrieves top documents (from Step 6).
  • Passes them into the LLM along with the user’s question.
  • Generates a structured answer while referencing the retrieved documents.

We next define a function to format the responses for better readability:

def print_response(response):
response_txt = response["result"]
for chunk in response_txt.split("\n"):
if not chunk:
print()
continue
print("\n".join(textwrap.wrap(chunk, 100, break_long_words=False)))

The function Improves readability by wrapping text at 100 characters and avoids breaking words mid-sentence for better UX.

We now query the system for key insights:

1) What is Meta’s most significant innovation?

response = qa.invoke("What is the most significant innovation from Meta?")
print_response(response)

2) What is the revenue for 2024 and % change?

response = qa.invoke("What is the revenue for 2024 and % change?")
Markdown(response["result"])

3) What is the revenue for 2023?

response = qa.invoke("What is the revenue for 2023?")
print_response(response)

4) Revenue minus costs & expenses (Profit Calculation)

response = qa.invoke("How much is the revenue minus the costs and expenses for 2024? Calculate the answer")
print_response(response)
response = qa.invoke("How much is the revenue minus the costs and expenses for 2023? Calculate the answer")
print_response(response)

5) Expected revenue for Q2 2024

response = qa.invoke("What is the expected revenue for the second quarter of 2024?")
print_response(response)

6) Overall Q1 2024 Outlook

response = qa.invoke("What is the overall outlook of Q1 2024?")
print_response(response)

Note: Special thanks to Venelin Valkov for his Github repo on AI-Bootcamp: Advanced RAG with Llama-3 in LangChain.

Congratulations! You have successfully built an AI-powered retrieval-based question-answering system using LangChain, Qdrant, and LlamaParse. You can now ask questions about Meta’s financial report and get precise answers from the document.

To further enhance your AI and data science skills, explore ProjectPro, a platform offering real-world projects to help you gain hands-on experience. Here are some exciting projects from the ProjectPro repository that you can work on next:

Start building more projects today to gain hands-on experience and accelerate your AI journey! 🚀

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here