Image by Editor | Ideogram
Â
Editor’s note: This is the second part of this tutorial. You can find the first part here.
Â
Code Walkthrough
Â
In this section, we will explore the process of building a RAG application that uses agents using LangChain. To effectively follow along with each step outlined in this guide, it is imperative to ensure that certain prerequisites are met:
- Python version 3: For this implementation, you will need Python version 3 or higher.Â
- OpenAI API Keys: These API keys facilitate communication between the application and OpenAI’s infrastructure, enabling access to advanced language processing functionalities. Sign up and grab your API keys here.
- LangChain: is a framework designed to simplify the integration of LLMs and retrieval systems
- Pinecone: This provides long-term memory for high-performance AI applications. It’s a managed, cloud-native vector database with a streamlined API and no infrastructure hassles.Â
Â
Import Packages
Â
Install and import the required packages.Â
# GLOBAL
import os
import pandas as pd
import numpy as np
import tiktoken
from uuid import uuid4
# from tqdm import tqdm
from dotenv import load_dotenv
from tqdm.autonotebook import tqdm
# LANGCHAIN
import langchain
from langchain.llms import OpenAI
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA
from langchain_groq import ChatGroq
from langchain_pinecone import PineconeVectorStore
from langchain_core.prompts import PromptTemplate
# VECTOR STORE
import pinecone
from pinecone import Pinecone, ServerlessSpec
# AGENTS
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain.agents import AgentExecutor, Tool, AgentType
from langchain.agents.react.agent import create_react_agent
from langchain import hub
Â
Load Environment Variables
Â
To keep our API keys private, we will load them as environmental variables from a .env file
load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
Â
Load Documents
Â
LangChain provides several Document Loaders based on the type of file you need to use. The most common ones include loaders for CSV, HTML, JSON, Markdown, File Directory, and Microsoft Office formats. You can find the full list here.
Additionally, you can load documents directly from services like Google Cloud, Notion, YouTube, and many others.
For this example, we will use a CSV file and the CSVLoader. Here’s how to load the file, with the following arguments:Â
- File path: The path to your CSV file.
- Source column: The column in the CSV file that contains the main data of interest, in this case, the transcript.
- Metadata columns: A list of column names that contain extra information (metadata) about each entry in the transcript.
# Load Documents
loader = CSVLoader(
  file_path="./tedx_document.csv",
  encoding='utf-8',
  source_column="transcript",
  metadata_columns=["main_speaker", "name", "speaker_occupation", "title", "url", "description"]
)
data = loader.load()
Â
Â
The CSVLoader allows us to upload a CSV file, with options to enhance the pipeline using metadata.
Â
Indexing
Â
The Vector Store Index converts your documents into vector representations. When you search, your query is also turned into a vector. The Vector Store Index then compares the query vector to all the document vectors, ranking them by how similar they are to your query.
This method lets you search your document collection based on meaning, rather than just exact keyword matches. To understand how vector search works, we will look at the concepts of tokenization, similarity, and embedding, which are done by embedding models.
Â
Tokenizer
Â
A token is a basic unit of meaning in a sentence or piece of text. Tokens can be words, punctuation marks, or even sub-words. These tokens are then converted into numerical vector representations, which LLMs can process.
Here’s an example using the tiktoken library, which employs the BPE (Byte Pair Encoding) algorithm to turn text into tokens. This library is used for models like GPT-3.5 and GPT-4. For a good explanation of the BPE algorithm, check out this resource from Hugging Face.
Source: https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken
# Tokenization
# Count the number of tokens in a given string
def num_tokens_from_string(question, encoding_name):
  encoding = tiktoken.get_encoding(encoding_name)
  num_tokens = encoding.encode(question)
  return encoding, num_tokens
question = "How many TEDx talks are on the list?"
encoding, num_tokens = num_tokens_from_string(question, "cl100k_base")
print(f'Number of Words: len(question.split())')
print(f'Number of Characters: len(question)')
print(f'List of Tokens: num_tokens')
print(f'Nr of Tokens: len(num_tokens)')
Â
The cl100k_base encoder, with 100k tokens, is the most common and efficient choice.
# Decoding tokenizer
encoding.decode([4438, 1690, 84296, 87, 13739, 527, 389, 279, 1160, 30])
Â
Embedding
Â
Embeddings are a method to represent complex data, like words, in a simpler, lower-dimensional form while keeping the meaningful similarities between the original data points.
Source: https://openai.com/index/new-embedding-models-and-api-updates
Â
Similarity
Â
The most common metric used for similarity search is cosine similarity. It is often used in semantic search and document classification because it compares the direction of vectors, which helps in understanding the overall content of documents. By comparing the vector representations of the query and the documents, cosine similarity can find and return the most similar and relevant documents in the search results.
Source: https://www.pinecone.io/learn/vector-similarity/
Cosine similarity measures how similar two non-zero vectors are. It calculates the cosine of the angle between the two vectors, giving a value between 1 (identical) and -1 (completely different).
# Define cosine similarity function
def cosine_similarity(query_emb, document_emb):
  # Calculate the dot product of the query and document embeddings
  dot_product = np.dot(query_emb, document_emb)
  # Calculate the L2 norms (magnitudes) of the query and document embeddings
  query_norm = np.linalg.norm(query_emb)
  document_norm = np.linalg.norm(document_emb)
  # Calculate the cosine similarity
  cosine_sim = dot_product / (query_norm * document_norm)
  return cosine_sim
Â
# Using text-embedding-3-large model
question = "What is the topic of the TEDx talk from Al Gore?"
document = "Averting the climate crisis"
embedding = OpenAIEmbeddings(model="text-embedding-3-large", openai_api_key=OPENAI_API_KEY)
query_emb = embedding.embed_query(question)
document_emb = embedding.embed_query(document)
cosine_sim = cosine_similarity(query_emb, document_emb)
# print(f'Query Vector: query_emb')
# print(f'Document Vector: document_emb')
print(f'Query Dimensions: len(query_emb)')
print(f'Document Dimensions: len(document_emb)')
print("Cosine Similarity:", cosine_sim)
Â
Text Splitters
Â
One notable limitation of LLMs is the context window, which determines the maximum amount of text or tokens a model can handle at once to generate a response. Hence, it becomes necessary to divide our documents into smaller chunks that fit within the model’s context window.Â
The RecursiveCharacterTextSplitter is a great tool for breaking down text. It works by dividing the text into smaller parts based on a set chunk size, using specific characters as separators.
In LangChain, it uses default separators like paragraphs, sentences, and words. This helps keep related text parts together, like paragraphs first, then sentences and words, which usually have strong connections in the text.
To use this tool effectively, we can combine RecursiveCharacterTextSplitter with the tiktoken library. This ensures that each split doesn’t go over the maximum token chunk size allowed by the language model. If a split is too big, it gets divided recursively until it fits.
Here’s how our text splitter looks:
- Model: gpt-3.5-turbo-0125 with a context window of 16,385 tokens.
- Chunk size: number of tokens in one chunk.
- Chunk overlap: number of tokens that overlap between two consecutive chunks
- Separators: the order in which separators are applied.
# Splitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
  model_name="gpt-3.5-turbo-0125",
  chunk_size=512,
  chunk_overlap=20,
  separators= ["\n\n", "\n", " ", ""])
Â
Â
Vector Store
Â
A vector store is a specialized database designed for storing and managing high-dimensional vector data. Instead of typical data formats, it stores data as vector embeddings. These embeddings are then used by LLMs to comprehend the context and meaning of the data, resulting in more accurate responses.
Pinecone is a serverless vector store known for its exceptional performance in fast vector search and retrieval processes.
To begin using Pinecone, the first step is to create an Index where our embeddings will be stored. This involves considering several parameters:
- Index name
- Dimension: should match the dimensions of the embedding model
- Metric: should align with the metric used to train the embedding model for optimal results
- Serverless specifications
# Pinecone Initialization
index_name = "langchain-pinecone-test"
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
pc = Pinecone(api_key = PINECONE_API_KEY)
Â
# Create Index
pc.create_index(
  name=index_name,
  dimension=1536,
  metric="cosine",
  spec=ServerlessSpec(
    cloud="aws"))
index = pc.Index(index_name)
# List Indexes
pc.list_indexes()
# Describe Index
index = pc.Index(index_name)
index.describe_index_stats()
Â
Namespaces
Â
Namespaces in Pinecone let you organize your data into different sections within an index. This helps you send queries to a specific section. For instance, you could divide your data based on content, language, or any other category that fits your needs.
Let’s start by uploading 100 data records to one namespace. Then, we’ll split it into two sections, each containing 50 records. Altogether, we’ll have three namespaces.
# Create Main Namespace
splits = text_splitter.split_documents(data[:100])
embed = embedding=OpenAIEmbeddings(model = "text-embedding-ada-002")
db = PineconeVectorStore.from_documents(documents=splits,
                    embedding=embed,
                    index_name=index_name,
                    namespace="main"
                    )
Â
# Create Vectorstore of Main index
vectorstore = PineconeVectorStore(index_name=index_name,
                 namespace="main",
                 embedding=embed)
Â
# Search for similarity
query = "Who is Al Gore"
similarity = vectorstore.similarity_search(query, k=4)
for i in range(len(similarity)):
 print(f"-------Result Nr. i-------")
 print(f"Main Speaker: similarity[i].metadata['main_speaker']")
 print(f" ")
Â
# Search for similarity with score
query = "Who is Al Gore"
similarity_with_score = vectorstore.similarity_search_with_score(query, k=4)
for i in range(len(similarity_with_score)):
 print(f"-------Result Nr. i-------")
 print(f"Title: similarity_with_score[i][0].metadata['title']")
 print(f"Main Speaker: similarity_with_score[i][0].metadata['main_speaker']")
 print(f"Score: similarity_with_score[i][1]")
 print(f" ")
Â
Next, we’ll generate two additional namespaces, each containing 50 records. To accomplish this, we’ll utilize the upsert function along with metadata to insert data into our index, but this time, into distinct namespaces. Initially, we’ll create the chunks.
# Create Chunked Metadata
def chunked_metadata_embeddings(documents, embed):
  chunked_metadata = []
  chunked_text = text_splitter.split_documents(documents)
  for index, text in enumerate(tqdm(chunked_text)):
    payload =
       "metadata":
         "source": text.metadata['source'],
         "row": text.metadata['row'],
         "chunk_num": index,
         "main_speaker": text.metadata['main_speaker'],
         "name": text.metadata['name'],
         "speaker_occupation": text.metadata['speaker_occupation'],
         "title": text.metadata['title'],
         "url": text.metadata['url'],
         "description": text.metadata['description'],
       ,
       "id": str(uuid4()),
       "values": embed.embed_documents([text.page_content])[0] # Assuming `embed` is defined elsewhere
    Â
    chunked_metadata.append(payload)
  return chunked_metadata
Â
# Create the first split
split_one = chunked_metadata_embeddings(data[:50], embed)
len(split_one)
# Create a second split
split_two = chunked_metadata_embeddings(data[50:100], embed)
len(split_two)
Â
# Upsert the document
def batch_upsert(split,
        index ,
        namespace,
        batch_size):
  print(f"Split Length: len(split)")
  for i in range(0, len(split), batch_size):
   batch = split[i:i + batch_size]
   index.upsert(vectors=batch, namespace=namespace)
Â
batch_upsert(split_one, index, "first_split", 10)
Â
The function below helps to find a specific chunk based on the main speaker. It gives back the title and the chunk ID, which you can use to locate it in the Pinecone cloud.
# Function to find item with main_speaker
def find_item_with_row(metadata_list, main_speaker):
  for item in metadata_list:
    if item['metadata']['main_speaker'] == main_speaker:
      return item
# Call the function to find item with main_speaker = Al Gore
result_item = find_item_with_row(split_one, "Al Gore")
# Print the result
print(f'Chunk Nr: result_item["metadata"]["chunk_num"]')
print(f'Chunk ID: result_item["id"]')
print(f'Chunk Title: result_item["metadata"]["title"]')
Â
Now we can observe that our index has two sections using the following function.
index.describe_index_stats()
Â
We can make the namespace for the second split and confirm that everything’s set up correctly.
batch_upsert(split_two, index, "last_split", 20)
Â
Next, we’ll test our namespaces by setting up two users, each of whom will send their query to a different namespace.
# Define Users
query_one = "Who is Al Gore?"
query_two = "Who is Rick Warren?"
# Users dictionary
users = [
      'name': 'John',
      'namespace': 'first_split',
      'query': query_one
      ,
    Â
      "name": "Jane",
      "namespace": 'last_split',
      "query": query_two
     ]
def vectorize_query(embed, query):
  return embed.embed_query(query)
Â
# Create our vectors for each of our queries:
query_vector_one = vectorize_query(embed, query_one)
query_vector_two = vectorize_query(embed, query_two)
len(query_vector_one), len(query_vector_two)
Â
# Define a list of new key-value pairs
new_key_value_pairs = [
  'vector_query': query_vector_one,
  'vector_query': query_vector_two,
]
# Loop through the list of users and the list of new key-value pairs
for user, new_pair in zip(users, new_key_value_pairs):
  user.update(new_pair)
Â
users[0]["name"], users[1]["name"]
Â
Â
print(f"Name: users[0]['name']")
print(f"Namespace: users[0]['namespace']")
print(f"Query: users[0]['query']")
print(f"Vector Query: users[0]['vector_query'][:3]")
Â
If we send the query to the namespace, we’ll receive the top_k vectors related to that query.
# Query the namespace
john = [t for t in users if t.get('name') == 'John'][0]
john_query_vector = john['vector_query']
john_namespace = john['namespace']
index.query(vector=john_query_vector, top_k=2, include_metadata=True, namespace=john_namespace)
Â
Now that our namespaces are set up, we can prepare our RAG pipeline using agents.
Â
Retrieval
Â
# Create vectorstore
embed = embedding=OpenAIEmbeddings(model = "text-embedding-ada-002")
vectorstore = PineconeVectorStore(index_name=index_name,
                 namespace="main",
                 embedding=embed)
Â
In this retrieval step, you can choose any LLM provider of your choice but for the sake of this article, we will stick to OpenAI. We will also add some memory to keep track of the QA chain.
# Retrieval
llm = ChatOpenAI(temperature=0.0, model="gpt-3.5-turbo", max_tokens=512)
# Conversational memory
conversational_memory = ConversationBufferWindowMemory(
            memory_key='chat_history',
            k=5,
            return_messages=True)
# Retrieval qa chain
qa_db = RetrievalQA.from_chain_type(
                  llm=llm,
                  chain_type="stuff",
                  retriever=vectorstore.as_retriever())
Â
Augmented
Â
We’ll be using a slightly changed prompt template. First, we’ll download the React template, a popular template that includes tools and agents. Then, we’ll add instructions on which tool to check out first.
A collection of templates can be found in the LangChain hub
prompt = hub.pull("hwchase17/react")
print(prompt.template)
Â
We will get this output:Â
Answer the following questions as best you can. You have access to the following tools:
tools
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [tool_names]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: input
Thought:agent_scratchpad
Â
Now we will replace this line:
“ Action: the action to take, should be one of [tool_names] “
With this line:
“ Action: the action to take, should be one of [tool_names]. Always look first in Pinecone Document Store “
# Set prompt template
template=""'
     Answer the following questions as best you can. You have access to the following tools:
     tools
     Use the following format:
     Question: the input question you must answer
     Thought: you should always think about what to do
     Action: the action to take, should be one of [tool_names]. Always look first in Pinecone Document Store
     Action Input: the input to the action
     Observation: the result of the action
     ... (this Thought/Action/Action Input/Observation can repeat 2 times)
     Thought: I now know the final answer
     Final Answer: the final answer to the original input question
     Begin!
     Question: input
     Thought:agent_scratchpad
     '''
prompt = PromptTemplate.from_template(template)
Â
Generation with Agent
Â
Finally, we will generate it with an agent. Still, before doing that, we must ensure that the vector store, which will be the first stop to find information, and a search API (Tavily search API), which will search across sources like Bing or Google and give back the most fitting content, are ready.Â
# Set up tools and agent
import os
TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")
tavily = TavilySearchResults(max_results=10, tavily_api_key=TAVILY_API_KEY)
tools = [
  Tool(
    name = "Pinecone Document Store",
    func = qa_db.run,
    description = "Use it to lookup information from the Pinecone Document Store"
  ),
  Tool(
    name="Tavily",
    func=tavily.run,
    description="Use this to lookup information from Tavily",
  )
]
agent = create_react_agent(llm,
             tools,
             prompt)
agent_executor = AgentExecutor(tools=tools,
            agent=agent,
            handle_parsing_errors=True,
            verbose=True,
            memory=conversational_memory)
Â
Once everything is ready, we can begin asking questions and see how the agents prioritize, the quality of their search, and the answers they provide.
agent_executor.invoke("input":"Can you give me one title of a TED talk of Al Gore as main speaker?. \ Please look in the pinecone document store metadata as it has the titlebased on the transcripts")
Â
Output:Â
'input': 'Can you give me one title of a TED talk of Al Gore as main speaker?. Please look in the pinecone document store metadata as it has the title based on the transcripts',
'chat_history': [],
'output': 'The title of a TED talk by Al Gore as the main speaker is "The case for optimism on climate change". Al Gore is a former Vice President of the United States known for his work on environmental issues, particularly climate change.'
Â
agent_executor.invoke("input": "What is the main topic of Dan Gilbert TEDx talks?")
Â
Output:Â
'input': 'What is the main topic of Dan Gilbert TEDx talks?',
'chat_history': [HumanMessage(content="Can you give me one title of a TED talk of Al Gore as main speaker?. Please look in the pinecone document store metadata as it has the title based on the transcripts"),
 AIMessage(content="The title of a TED talk by Al Gore as the main speaker is "The case for optimism on climate change". Al Gore is a former Vice President of the United States known for his work on environmental issues, particularly climate change.")],
'output': "The main topic of Dan Gilbert's TEDx talks is the surprising science of happiness."
Â
We can have a look at the conversation history using load_memory_variables().Â
conversational_memory.load_memory_variables()
Â
Output:Â
'chat_history': [HumanMessage(content="Can you give me one title of a TED talk of Al Gore as main speaker?.                 Please look in the pinecone document store metadata as it has the title                 based on the transcripts"),
 AIMessage(content="The title of a TED talk by Al Gore as the main speaker is "The case for optimism on climate change". Al Gore is a former Vice President of the United States known for his work on environmental issues, particularly climate change."),
 HumanMessage(content="Is Dan Gilbert a main speaker of TEDx talks? If yes, give me the source of your answer"),
 AIMessage(content="Dan Gilbert is a main speaker of TEDx talks. The source of this information can be found on premierespeakers.com."),
 HumanMessage(content="What is the main topic of Dan Gilbert TEDx talks?"),
 AIMessage(content="The main topic of Dan Gilbert's TEDx talks is the surprising science of happiness.")]
Â
You can also clear the memory (if you want to).Â
agent_executor.memory.clear()
Â
Conclusion
Â
We covered a lot in this article, we talked about RAG, using the Naive RAG, and the benefits of using Agentic RAG. We dug deeper into how you can build an application that uses Agents for generation and we covered all the steps you need to follow such as loading documents, indexing, text splitting, vector stores, retrieval, augmenting and finally generating with Agent.Â
Here is the repository for the complete code. If you have any questions or encounter any issues while exploring this article, please do not hesitate to reach out to us. For further exploration and detailed information about Agentic RAG, you can refer to the following online resources:
Â
Â
Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.