to start studying LLMs with all this content over the internet, and new things are coming up each day. I’ve read some guides from Google, OpenAI, and Anthropic and noticed how each focuses on different aspects of Agents and LLMs. So, I decided to consolidate these concepts here and add other important ideas that I think are essential if you’re starting to study this field.
This post covers key concepts with code examples to make things concrete. I’ve prepared a Google Colab notebook with all the examples so you can apply the code while reading the article. To use it, you’ll need an API key — check section 5 of my previous article if you don’t know how to get one.
While this guide gives you the essentials, I recommend reading the full articles from these companies to deepen your understanding.
I hope this helps you to build a solid foundation as you start your journey with LLMs!
In this MindMap, you can check a summary of this article’s content.
What is an agent?
“Agent” can be defined in several ways. Each company whose guide I’ve read defines agents differently. Let’s examine these definitions and compare them:
“Agents are systems that independently accomplish tasks on your behalf.” (Open AI)
“In its most fundamental form, a Generative AI agent can be defined as an application that attempts to achieve a goal by observing the world and acting upon it using the tools that it has at its disposal. Agents are autonomous and can act independently of human intervention, especially when provided with proper goals or objectives they are meant to achieve. Agents can also be proactive in their approach to reaching their goals. Even in the absence of explicit instruction sets from a human, an agent can reason about what it should do next to achieve its ultimate goal.” (Google)
“Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
– Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
– Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.” (Anthropic)
The three definitions emphasize different aspects of an agent. However, all of them agree that agents:
- Operate autonomously to perform tasks
- Make decisions about what to do next
- Use tools to achieve goals
An agent is composed of 3 main components:
- Model
- Instructions/Orchestration
- Tools

First, I’ll define each component in a straightforward phrase so you can have an overview. Then, in the following section, we’ll dive into each component.
- Model: a language model that generates the output.
- Instructions/Orchestration: explicit guidelines defining how the agent behaves.
- Tools: allows the agent to interact with external data and services.
Model
Model refers to the language model (LM). In simple terms, it predicts the next word or sequence of words based on the words it has already seen.
If you want to understand how these models work behind the black box, here is a video from 3Blue1Brown that explains it.
Agents vs models
Agents and models are not the same. The model is a component of an agent, and it is used by it. While models are limited to predicting a response based on their training data, agents extend this functionality by acting independently to achieve specific goals.
Here is a summary of the main differences between Models and Agents from Google’s paper.

Large Language Models
The other L from LLM refers to “Large”, which mainly refers to the number of parameters it was trained on. These models can have hundreds of billions or even trillions of parameters. They are trained on huge data and need heavy computer power to be trained on.
Examples of LLMs are GPT 4o, Gemini Flash 2.0 , Gemini Pro 2.5, Claude 3.7 Sonnet.
Small Language Models
We also have Small Language Models (SLM). They are used for simpler tasks where you need less data and fewer parameters, are lighter to run, and are easier to control.
SLMs have fewer parameters (typically under 10 billion), dramatically reducing the computational costs and energy usage. They focus on specific tasks and are trained on smaller datasets. This maintains a balance between performance and resource efficiency.
Examples of SLMs are Llama 3.1 8B (Meta), Gemma2 9B (Google), Mistral 7B (Mistral AI).
Open Source vs Closed Source
Those models can be open source or closed. Being open source means that the code — sometimes model weights and training data, too — is publicly available for anyone to use freely, understand how it works internally, and adjust for specific tasks.
The closed model means that the code isn’t publicly available. Only the company that developed it can control its use, and users can only access it through APIs or paid services. Sometimes, they have a free tier, like Gemini has.
Here, you can check some open source models on Hugging Face.

Those with * in size mean this information is not publicly available, but there are rumors of hundreds of billions or even trillions of parameters.
Instructions/Orchestration
Instructions are explicit guidelines and guardrails defining how the agent behaves. In its most fundamental form, an agent would consist of just “Instructions” for this component, as defined in Open AI’s guide. However, the agent could have more than just “Instructions” to handle more complex scenarios. In Google’s paper, they call this component “Orchestration” instead, and it involves three layers:
- Instructions
- Memory
- Model-based Reasoning/Planning
Orchestration follows a cyclical pattern. The agent gathers information, processes it internally, and then uses those insights to determine its next move.

Instructions
The instructions could be the model’s goals, profile, roles, rules, and information you think is important to enhance its behavior.
Here is an example:
system_prompt = """
You are a friendly and a programming tutor.
Always explain concepts in a simple and clear way, using examples when possible.
If the user asks something unrelated to programming, politely bring the conversation back to programming topics.
"""
In this example, we told the role of the LLM, the expected behavior, how we wanted the output — simple and with examples when possible — and set limits on what it is allowed to talk about.
Model-based Reasoning/Planning
Some reasoning techniques, such as ReAct and Chain-of-Thought, give the orchestration layer a structured way to take in information, perform internal reasoning, and produce informed decisions.
Chain-of-Thought (CoT) is a prompt engineering technique that enables reasoning capabilities through intermediate steps. It is a way of questioning a language model to generate a step-by-step explanation or reasoning process before arriving at a final answer. This method helps the model to break down the problem and not skip any intermediate tasks to avoid reasoning failures.
Prompting example:
system_prompt = f"""
You are the assistant for a tiny candle shop.
Step 1:Check whether the user mentions either of our candles:
• Forest Breeze (woodsy scent, 40 h burn, $18)
• Vanilla Glow (warm vanilla, 35 h burn, $16)
Step 2:List any assumptions the user makes
(e.g. "Vanilla Glow lasts 50 h" or "Forest Breeze is unscented").
Step 3:If an assumption is wrong, correct it politely.
Then answer the question in a friendly tone.
Mention only the two candles above-we don't sell anything else.
Use exactly this output format:
Step 1:<your reasoning>
Step 2:<your reasoning>
Step 3:<your reasoning>
Response to user: <final answer>
"""
Here is an example of the model output for the user query: “Hi! I’d like to buy the Vanilla Glow. Is it $10?”. You can see the model following our guidelines from each step to build the final answer.

ReAct is another prompt engineering technique that combines reasoning and acting. It provides a thought process strategy for language models to reason and take action on a user query. The agent continues in a loop until it accomplishes the task. This technique overcomes weaknesses of reasoning-only methods like CoT, such as hallucination, because it reasons in external information obtained through actions.
Prompting example:
system_prompt= """You are an agent that can call two tools:
1. CurrencyAPI:
• input: {base_currency (3-letter code), quote_currency (3-letter code)}
• returns: exchange rate (float)
2. Calculator:
• input: {arithmetic_expression}
• returns: result (float)
Follow **strictly** this response format:
Thought: <your reasoning>
Action: <ToolName>[<arguments>]
Observation: <tool result>
… (repeat Thought/Action/Observation as needed)
Answer: <final answer for the user>
Never output anything else. If no tool is needed, skip directly to Answer.
"""
Here, I haven’t implemented the functions (the model is hallucinating to get the currency), so it is just an example of the reasoning trace:

These techniques are good to use when you need transparency and control over what and why the agent is giving that answer or taking an action. It helps debug your system, and if you analyze it, it could provide signals for improving prompts.
If you want to read more, these techniques were proposed by Google’s researchers in the paper Chain of Thought Prompting Elicits Reasoning in Large Language Models and REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS.
Memory
LLMs don’t have memory built in. This “Memory” is some content you pass inside your prompt to give the model context. We can refer to two types of memory: short-term and long-term.
- Short-term memory refers to the immediate context the model has access to during an interaction. This could be the latest message, the last N messages, or a summary of previous messages. The amount could vary based on the model’s context limitations — once you hit that limit, you could drop older messages to give space to new ones.
- Long-term memory involves storing important information beyond the model’s context window for future use. To work around this, you could summarize past conversations or get key information and save them externally, typically in a vector database. When needed, the relevant information is retrieved using Retrieval-Augmented Generation (RAG) techniques to refresh the model’s understanding. We’ll talk about RAG in the following section.
Here is just a simple example of managing short-term memory manually. You can check the Google Colab notebook for this code execution and a more detailed explanation.
# System prompt
system_prompt = """
You are the assistant for a tiny candle shop.
Step 1:Check whether the user mentions either of our candles:
• Forest Breeze (woodsy scent, 40 h burn, $18)
• Vanilla Glow (warm vanilla, 35 h burn, $16)
Step 2:List any assumptions the user makes
(e.g. "Vanilla Glow lasts 50 h" or "Forest Breeze is unscented").
Step 3:If an assumption is wrong, correct it politely.
Then answer the question in a friendly tone.
Mention only the two candles above-we don't sell anything else.
Use exactly this output format:
Step 1:<your reasoning>
Step 2:<your reasoning>
Step 3:<your reasoning>
Response to user: <final answer>
"""
# Start a chat_history
chat_history = []
# First message
user_input = "I would like to buy 1 Forest Breeze. Can I pay $10?"
full_content = f"System instructions: {system_prompt}\n\n Chat History: {chat_history} \n\n User message: {user_input}"
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=full_content
)
# Append to chat history
chat_history.append({"role": "user", "content": user_input})
chat_history.append({"role": "assistant", "content": response.text})
# Second Message
user_input = "What did I say I wanted to buy?"
full_content = f"System instructions: {system_prompt}\n\n Chat History: {chat_history} \n\n User message: {user_input}"
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=full_content
)
# Append to chat history
chat_history.append({"role": "user", "content": user_input})
chat_history.append({"role": "assistant", "content": response.text})
print(response.text)
We actually pass to the model the variable full_content
, composed of system_prompt
(containing instructions and reasoning guidelines), the memory (chat_history
), and the new user_input
.

In summary, you can combine instructions, reasoning guidelines, and memory in your prompt to get better results. All of this combined forms one of an agent’s components: Orchestration.
Tools
Models are really good at processing information, however, they are limited by what they have learned from their training data. With access to tools, the models can interact with external systems and access knowledge beyond their training data.

Functions and Function Calling
Functions are self-contained modules of code that accomplish a specific task. They are reusable code that you can use over and over again.
When implementing function calling, you connect a model with functions. You provide a set of predefined functions, and the model determines when to use each function and which arguments are required based on the function’s specifications.
The Model does not execute the function itself. It will inform which functions should be called and pass the parameters (inputs) to use that function based on the user query, and you will have to create the code to execute this function later. However, if we build an agent, then we can program its workflow to execute the function and answer based on that, or we can use Langchain, which has an abstraction of the code, and you just pass the functions to the pre-built agent. Remember that an agent is a composition of (model + instructions + tools).
In this way, you extend your agent’s capabilities to use external tools, such as calculators, and take actions, such as interacting with external systems using APIs.
Here, I’ll first show you an LLM and a basic function call so you can understand what is happening. It is great to use LangChain because it simplifies your code, but you should understand what is happening underneath the abstraction. At the end of the post, we’ll build an agent using LangChain.
The process of creating a function call:
- Define the function and a function declaration, which describes the function’s name, parameters, and purpose to the model.
- Call LLM with function declarations. In addition, you can pass multiple functions and define if the model can choose any function you specified, if it is forced to call exactly one specific function, or if it can’t use them at all.
- Execute Function Code.
- Answer the user.
# Shopping list
shopping_list: List[str] = []
# Functions
def add_shopping_items(items: List[str]):
"""Add multiple items to the shopping list."""
for item in items:
shopping_list.append(item)
return {"status": "ok", "added": items}
def list_shopping_items():
"""Return all items currently in the shopping list."""
return {"shopping_list": shopping_list}
# Function declarations
add_shopping_items_declaration = {
"name": "add_shopping_items",
"description": "Add one or more items to the shopping list",
"parameters": {
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {"type": "string"},
"description": "A list of shopping items to add"
}
},
"required": ["items"]
}
}
list_shopping_items_declaration = {
"name": "list_shopping_items",
"description": "List all current items in the shopping list",
"parameters": {
"type": "object",
"properties": {},
"required": []
}
}
# Configuration Gemini
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
tools = types.Tool(function_declarations=[
add_shopping_items_declaration,
list_shopping_items_declaration
])
config = types.GenerateContentConfig(tools=[tools])
# User input
user_input = (
"Hey there! I'm planning to bake a chocolate cake later today, "
"but I realized I'm out of flour and chocolate chips. "
"Could you please add those items to my shopping list?"
)
# Send the user input to Gemini
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=user_input,
config=config,
)
print("Model Output Function Call")
print(response.candidates[0].content.parts[0].function_call)
print("\n")
#Execute Function
tool_call = response.candidates[0].content.parts[0].function_call
if tool_call.name == "add_shopping_items":
result = add_shopping_items(**tool_call.args)
print(f"Function execution result: {result}")
elif tool_call.name == "list_shopping_items":
result = list_shopping_items()
print(f"Function execution result: {result}")
else:
print(response.candidates[0].content.parts[0].text)
In this code, we are creating two functions: add_shopping_items
and list_shopping_items
. We defined the function and the function declaration, configured Gemini, and created a user input. The model had two functions available, but as you can see, it chose add_shopping_items
and got the args={‘items’: [‘flour’, ‘chocolate chips’]}
, which was exactly what we were expecting. Finally, we executed the function based on the model output, and those items were added to the shopping_list
.

External data
Sometimes, your model doesn’t have the right information to answer properly or do a task. Access to external data allows us to provide additional data to the model, beyond the foundational training data, eliminating the need to train the model or fine-tune it on this additional data.
Example of the data:
- Website content
- Structured Data in formats like PDF, Word Docs, CSV, Spreadsheets, etc.
- Unstructured Data in formats like HTML, PDF, TXT, etc.
One of the most common uses of a data store is the implementation of RAGs.
Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) means:
- Retrieval -> When the user asks the LLM a question, the RAG system will search for an external source to retrieve relevant information for the query.
- Augmented -> The relevant information will be incorporated into the prompt.
- Generation -> The LLM then generates a response based on both the original prompt and the additional context retrieved.
Here, I’ll show you the steps of a standard RAG. We have two pipelines, one for storing and the other for retrieving.

First, we have to load the documents, split them into smaller chunks of text, embed each chunk, and store them in a vector database.
Important:
- Breaking down large documents into smaller chunks is important because it makes a more focused retrieval, and LLMs also have context window limits.
- Embeddings create numerical representations for pieces of text. The embedding vector tries to capture the meaning, so text with similar content will have similar vectors.
The second pipeline retrieves the relevant information based on a user query. First, embed the user query and retrieve relevant chunks in the vector store using some calculation, such as basic semantic similarity or maximum marginal relevance (MMR), between the embedded chunks and the embedded user query. Afterward, you can combine the most relevant chunks before passing them into the final LLM prompt. Finally, add this combination of chunks to the LLM instructions, and it can generate an answer based on this new context and the original prompt.
In summary, you can give your agent more knowledge and the ability to take action with tools.
Enhancing model performance
Now that we have seen each component of an agent, let’s talk about how we could enhance the model’s performance.
There are some strategies for enhancing model performance:
- In-context learning
- Retrieval-based in-context learning
- Fine-tuning based learning

In-context learning
In-context learning means you “teach” the model how to perform a task by giving examples directly in the prompt, without changing the model’s underlying weights.
This method provides a generalized approach with a prompt, tools, and few-shot examples at inference time, allowing it to learn “on the fly” how and when to use those tools for a specific task.
There are some types of in-context learning:

We already saw examples of Zero-shot, CoT, and ReAct in the previous sections, so now here is an example of one-shot learning:
user_query= "Carlos to set up the server by Tuesday, Maria will finalize the design specs by Thursday, and let's schedule the demo for the following Monday."
system_prompt= f""" You are a helpful assistant that reads a block of meeting transcript and extracts clear action items.
For each item, list the person responsible, the task, and its due date or timeframe in bullet-point form.
Example 1
Transcript:
'John will draft the budget by Friday. Sarah volunteers to review the marketing deck next week. We need to send invites for the kickoff.'
Actions:
- John: Draft budget (due Friday)
- Sarah: Review marketing deck (next week)
- Team: Send kickoff invites
Now you
Transcript: {user_query}
Actions:
"""
# Send the user input to Gemini
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=system_prompt,
)
print(response.text)
Here is the output based on your query and the example:

Retrieval-based in-context learning
Retrieval-based in-context learning means the model retrieves external context (like documents) and adds this relevant content retrieved into the model’s prompt at inference time to enhance its response.
RAGs are important because they reduce hallucinations and enable LLMs to answer questions about specific domains or private data (like a company’s internal documents) without needing to be retrained.
If you missed it, go back to the last section, where I explained RAG in detail.
Fine-tuning-based learning
Fine-tuning-based learning means you train the model further on a specific dataset to “internalize” new behaviors or knowledge. The model’s weights are updated to reflect this training. This method helps the model understand when and how to apply certain tools before receiving user queries.
There are some common techniques for fine-tuning. Here are a few examples so you can search to study further.

Analogy to compare the 3 strategies
Imagine you’re training a tour guide to receive a group of people in Iceland.
- In-Context Learning: you give the tour guide a few handwritten notes with some examples like “If someone asks about Blue Lagoon, say this. If they ask about local food, say that”. The guide doesn’t know the city deeply, but he can follow your examples as long the tourists stay within those topics.
- Retrieval-Based Learning: you equip the guide with a phone + map + access to Google search. The guide doesn’t need to memorize everything but knows how to look up information instantly when asked.
- Fine-Tuning: you give the guide months of immersive training in the city. The knowledge is already in their head when they start giving tours.

Where does LangChain come in?
LangChain is a framework designed to simplify the development of applications powered by large language models (LLMs).
Within the LangChain ecosystem, we have:
- LangChain: The basic framework for working with LLMs. It allows you to change between providers or combine components when building applications without altering the underlying code. For example, you could switch between Gemini or GPT models easily. Also, it makes the code simpler. In the next section, I’ll compare the code we built in the section on function calling and how we could do that with LangChain.
- LangGraph: For building, deploying, and managing agent workflows.
- LangSmith: For debugging, testing, and monitoring your LLM applications
While these abstractions simplify development, understanding their underlying mechanics through checking the documentation is essential — the convenience these frameworks provide comes with hidden implementation details that can impact performance, debugging, and customization options if not properly understood.
Beyond LangChain, you might also consider OpenAI’s Agents SDK or Google’s Agent Development Kit (ADK), which offer different approaches to building agent systems.
Let’s build one agent using LangChain
Here, differently from the code in the “Function Calling” section, we don’t have to create function declarations like we did before manually. Using the @tool
decorator above our functions, LangChain automatically converts them into structured descriptions that are passed to the model behind the scenes.
ChatPromptTemplate
organizes information in your prompt, creating consistency in how information is presented to the model. It combines system instructions + the user’s query + agent’s working memory. This way, the LLM always gets information in a format it can easily work with.
The MessagesPlaceholder
component reserves a place in the prompt template and the agent_scratchpad
is the agent’s working memory. It contains the history of the agent’s thoughts, tool calls, and the results of those calls. This allows the model to see its previous reasoning steps and tool outputs, enabling it to build on past actions and make informed decisions.
Another key difference is that we don’t have to implement the logic with conditional statements to execute the functions. The create_openai_tools_agent
function creates an agent that can reason about which tools to use and when. In addition, the AgentExecutor
orchestrates the process, managing the conversation between the user, agent, and tools. The agent determines which tool to use through its reasoning process, and the executor takes care of the function execution and handling the result.
# Shopping list
shopping_list = []
# Functions
@tool
def add_shopping_items(items: List[str]):
"""Add multiple items to the shopping list."""
for item in items:
shopping_list.append(item)
return {"status": "ok", "added": items}
@tool
def list_shopping_items():
"""Return all items currently in the shopping list."""
return {"shopping_list": shopping_list}
# Configuration
llm = ChatGoogleGenerativeAI(
model="gemini-2.0-flash",
temperature=0
)
tools = [add_shopping_items, list_shopping_items]
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant that helps manage shopping lists. "
"Use the available tools to add items to the shopping list "
"or list the current items when requested by the user."),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad")
])
# Create the Agent
agent = create_openai_tools_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
# User input
user_input = (
"Hey there! I'm planning to bake a chocolate cake later today, "
"but I realized I'm out of flour and chocolate chips. "
"Could you please add those items to my shopping list?"
)
# Send the user input to Gemini
response = agent_executor.invoke({"input": user_input})
When we use verbose=True
, we can see the reasoning and actions while the code is being executed.

And the final result:

When should you build an agent?
Remember that we discussed agents’s definitions in the first section and saw that they operate autonomously to perform tasks. It’s cool to create agents, even more because of the hype. However, building an agent is not always the most efficient solution, and a deterministic solution may suffice.
A deterministic solution means that the system follows clear and predefined rules without an interpretation. This way is better when the task is well-defined, stable, and benefits from clarity. In addition, in this way, it is easier to test and debug, and it is good when you need to know exactly what is happening given an input, no “black box”. Anthropic’s guide shows many different LLM Workflows where LLMs and tools are orchestrated through predefined code paths.
The best practices guide for building agents from Open AI and Anthropic recommend first finding the simplest solution possible and only increasing the complexity if needed.
When you are evaluating if you should build an agent, consider the following:
- Complex decisions: when dealing with processes that require nuanced judgment, handling exceptions, or making decisions that depend heavily on context — such as determining whether a customer is eligible for a refund.
- Diffult-to-maintain rules: If you have workflows built on complicated sets of rules that are difficult to update or maintain without risk of making mistakes, and they are constantly changing.
- Dependence on unstructured data: If you have tasks that require understanding written or spoken language, getting insights from documents — pdfs, emails, images, audio, html pages… — or chatting with users naturally.
Conclusion
We saw that agents are systems designed to accomplish tasks on human behalf independently. These agents are composed of instructions, the model, and tools to access external data and take actions. There are some ways we could enhance our model by improving the prompt with examples, using RAG to give more context, or fine-tuning it. When building an agent or LLM workflow, LangChain can help simplify the code, but you should understand what the abstractions are doing. Always keep in mind that simplicity is the best way to build agentic systems, and only follow a more complex approach if needed.
Next Steps
If you are new to this content, I recommend that you digest all of this first, read it a few times, and also read the full articles I recommended so you have a solid foundation. Then, try to start building something, like a simple application, to start practicing and creating the bridge between this theoretical content and the practice. Beginning to build is the best way to learn these concepts.
As I told you before, I have a simple step-by-step guide for creating a chat in Streamlit and deploying it. There is also a video on YouTube explaining this guide in Portuguese. It is a good starting point if you haven’t done anything before.
I hope you enjoyed this tutorial.
You can find all the code for this project on my GitHub or Google Colab.
Follow me on:
Resources
Building effective agents – Anthropic
Agents – Google
A practical guide to building agents – OpenAI
Chain of Thought Prompting Elicits Reasoning in Large Language Models – Google Research
REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS – Google Research
Small Language Models: A Guide With Examples – DataCamp