Every company I worked for until today, there it was: the resilient MS Excel.
Excel was first released in 1985 and has remained strong until today. It has survived the rise of relational databases, the evolution of many programming languages, the Internet with its infinite number of online applications, and finally, it is also surviving the era of the AI.
Phew!
Do you have any doubts about how resilient Excel is? I don’t.
I think the reason for that is its practicality to start and manipulate a document quickly. Think about this situation: we’re at work, in a meeting, and suddenly the leadership shares a CSV file and asks for a quick calculation or a few calculated numbers. Now, the options are:
1. Open an IDE (or a notebook) and start coding like crazy to generate a simple matplotlib graphic;
2. Open Power BI, import the data, and start creating a report with dynamic graphics.
3. Open the CSV in Excel, write a couple of formulas, and create a graphic.
I can’t speak for you, but many times I go for option 3. Especially because Excel files are compatible with everything, easily shareable, and beginner-friendly.
I am saying all of this as an Introduction to make my point that I don’t think that Excel files are going away anytime soon, even with the fast development of AI. Many will love that, many will hate that.
So, my action here was to leverage AI to make Excel files better documented. One of the main complaints of data teams about Excel is the lack of best practices and reproducibility, given that the names of the columns can have any names and data types, but zero documentation.
So, I have created an AI Agent that reads the Excel file and creates this small documentation. Here is how it works:
- The Excel file is converted to CSV and fed into the Large Language Model (LLM).
- The AI Agent generates the data dictionary with column information (variable name, data type, description).
- The data dictionary gets added as comments to the Excel file’s header.
- Output file saved with comments.
Ok. Hands-on now. Let’s get that done in this tutorial.
Code
We will begin by setting up a virtual environment. Create a venv
with the tool of your choice, such as Poetry, Python Venv, Anaconda, or UV. I really like UV, as it is the fastest and the simplest, in my opinion. If you have UV installed [5], open a terminal and create your venv
.
uv init data-docs
cd data-docs
uv venv
uv add streamlit openpyxl pandas agno mcp google-genai
Now, let us import the necessary modules. This project was created with Python 3.12.1, but I believe Python 3.9 or higher might do the trick already. We will use:
- Agno: for the AI Agent management
- OpenPyxl: for the manipulation of Excel files
- Streamlit: for the front-end interface.
- Pandas, OS, JSON, Dedent and Google Genai as support modules.
# Imports
import os
import json
import streamlit as st
from textwrap import dedent
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools
from openpyxl import load_workbook
from openpyxl.comments import Comment
import pandas as pd
Great. The next step is creating the functions we’ll need to handle the Excel files and to create the AI Agent.
Notice that all the functions have detailed docstrings. This is intentional because LLMs use docstrings to know what a given function does and decide whether to use it or not as a tool.
So, if you’re using Python functions as Tools for an AI Agent, make sure to use detailed docstrings. Nowadays, with free copilots such as Windsurf [6] it is even easier to create them.
Converting the file to CSV
This function will:
- Take the Excel file and read only the first 10 rows. This is enough for us to send to the LLM. Doing that, we are also preventing sending too many tokens as input and making this agent too expensive.
- Save the file as CSV to use as input for the AI Agent. The CSV format is easier for the model to take in, as it is a bunch of text separated by commas. And we know LLMs shine working with text.
Here is the function.
def convert_to_csv(file_path:str):
"""
Use this tool to convert the excel file to CSV.
* file_path: Path to the Excel file to be converted
"""
# Load the file
df = pd.read_excel(file_path).head(10)
# Convert to CSV
st.write("Converting to CSV... :leftwards_arrow_with_hook:")
return df.to_csv('temp.csv', index=False)
Let’s move on.
Creating the Agent
The next function creates the AI agent. I am using Agno
[1], as it is very versatile and easy to use. I also chose the model Gemini 2.0 Flash
. During the test phase, this was the best-performing model generating the data docs. To use it, you will need an API Key from Google. Don’t forget to get one here [7].
The function:
- Receives the CSV output from the previous function.
- Passes through the AI Agent, which generates the data dictionary with column name, description, and data type.
- Notice that the
description
argument is the prompt for the agent. Make it detailed and precise. - The data dictionary will be saved as a
JSON
file using a tool calledFileTools
that can read and write files. - I have set up
retries=2
so we can work around any error on a first try.
def create_agent(apy_key):
agent = Agent(
model=Gemini(id="gemini-2.0-flash", api_key=apy_key),
description= dedent("""\
You are an agent that reads the temp.csv dataset presented to you and
based on the name and data type of each column header, determine the following information:
- The data types of each column
- The description of each column
- The first column numer is 0
Using the FileTools provided, create a data dictionary in JSON format that includes the below information:
<ColNumber>: ColName: <ColName>, DataType: <DataType>, Description: <Description>
If you are unable to determine the data type or description of a column, return 'N/A' for that column for the missing values.
\
"""),
tools=[ FileTools(read_files=True, save_files=True) ],
retries=2,
show_tool_calls=True
)
return agent
Ok. Now we need another function to save the data dictionary to the file.
Adding Data Dictionary to the File’s Header
This is the last function to be created. It will:
- Get the data dictionary
json
from the previous step and the original Excel file. - Add the data dictionary to the file’s header as comments.
- Save the output file.
- Once the file is saved, it displays a download button for the user to get the modified file.
def add_comments_to_header(file_path:str, data_dict:dict="data_dict.json"):
"""
Use this tool to add the data dictionary data_dict.json as comments to the header of an Excel file and save the output file.
The function takes the Excel file path as argument and adds the data_dict.json as comments to each cell
Start counting from column 0
in the first row of the Excel file, using the following format:
* Column Number: <column_number>
* Column Name: <column_name>
* Data Type: <data_type>
* Description: <description>
Parameters
----------
* file_path : str
The path to the Excel file to be processed
* data_dict : dict
The data dictionary containing the column number, column name, data type, description, and number of null values
"""
# Load the data dictionary
data_dict = json.load(open(data_dict))
# Load the workbook
wb = load_workbook(file_path)
# Get the active worksheet
ws = wb.active
# Iterate over each column in the first row (header)
for n, col in enumerate(ws.iter_cols(min_row=1, max_row=1)):
for header_cell in col:
header_cell.comment = Comment(dedent(f"""\
ColName: data_dict[str(n)]['ColName'],
DataType: data_dict[str(n)]['DataType'],
Description: data_dict[str(n)]['Description']\
"""),'AI Agent')
# Save the workbook
st.write("Saving File... :floppy_disk:")
wb.save('output.xlsx')
# Create a download button
with open('output.xlsx', 'rb') as f:
st.download_button(
label="Download output.xlsx",
data=f,
file_name='output.xlsx',
mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
)
Ok. The next step is to glue all of this together on a Streamlit front-end script.
Streamlit Front-End
In this step, I could have created a different file for the front-end and imported the functions in there. But I decided to use the same file, so let’s start with the famous:
if __name__ == "__main__":
First, a couple of lines to configure the page and messages displayed in the Web Application. We will use the content centered
on the page, and there is some information about how the App works.
# Config page Streamlit
st.set_page_config(layout="centered",
page_title="Data Docs",
page_icon=":paperclip:",
initial_sidebar_state="expanded")
# Title
st.title("Data Docs :paperclip:")
st.subheader("Generate a data dictionary for your Excel file.")
st.caption("1. Enter your Gemini API key and the path of the Excel file on the sidebar.")
st.caption("2. Run the agent.")
st.caption("3. The agent will generate a data dictionary and add it as comments to the header of the Excel file.")
st.caption("ColName: <ColName> | DataType: <DataType> | Description: <Description>")
st.divider()
Next, we’ll set up the sidebar, where the user can input their API Key from Google and select a .xlsx
file to be modified.
There is a button to run the application, another to reset the app state, and a progress bar. Nothing too fancy.
with st.sidebar:
# Enter your API key
st.caption("Enter your API key and the path of the Excel file.")
api_key = st.text_input("API key: ", placeholder="Google Gemini API key", type="password")
# Upload file
input_file = st.file_uploader("File upload",
type='xlsx')
# Run the agent
agent_run = st.button("Run")
# progress bar
progress_bar = st.empty()
progress_bar.progress(0, text="Initializing...")
st.divider()
# Reset session state
if st.button("Reset Session"):
st.session_state.clear()
st.rerun()
Once the run button is clicked, it triggers the rest of the code to run the Agent. Here is the sequence of steps performed:
- The first function is called to transform the file to CSV
- The progress is registered on the progress bar.
- The Agent is created.
- Progress bar updated.
- A prompt is fed into the agent to read the
temp.csv
file, create the data dictionary, and save the output todata_dictionary.json
. - The data dictionary is printed on the screen, so the user can see what was generated while it’s being saved to the Excel file.
- The Excel file is modified and saved.
# Create the agent
if agent_run:
# Convert Excel file to CSV
convert_to_csv(input_file)
# Register progress
progress_bar.progress(15, text="Processing CSV...")
# Create the agent
agent = create_agent(api_key)
# Start the script
st.write("Running Agent... :runner:")
# Register progress
progress_bar.progress(50, text="AI Agent is running...")
# Run the agent
agent.print_response(dedent(f"""\
1. Use FileTools to read the temp.csv as input to create the data dictionary for the columns in the dataset.
2. Using the FileTools tool, save the data dictionary to a file named 'data_dict.json'.
\
"""),
markdown=True)
# Print the data dictionary
st.write("Generating Data Dictionary... :page_facing_up:")
with open('data_dict.json', 'r') as f:
data_dict = json.load(f)
st.json(data_dict, expanded=False)
# Add comments to header
add_comments_to_header(input_file, 'data_dict.json')
# Remove temporary files
st.write("Removing temporary files... :wastebasket:")
os.remove('temp.csv')
os.remove('data_dict.json')
# If file exists, show success message
if os.path.exists('output.xlsx'):
st.success("Done! :white_check_mark:")
os.remove('output.xlsx')
# Progress bar end
progress_bar.progress(100, text="Done!")
That’s it. Here is a demonstration of the agent in action.

Beautiful result!
Try It
You can try the deployed app here: https://excel-datadocs.streamlit.app/
Before You Go
In my humble opinion, Excel files are not going away anytime soon. Loving or hating them, we’ll have to stick with them for a while.
Excel files are versatile, easy to handle and share, thus they are still very useful for the routine ad-hoc tasks at work.
However, now we can leverage AI to help us handle those files and make them better. Artificial Intelligence is touching so many points of our lives. The routine and tools at work are only another one.
Let’s take advantage of AI and work smarter every day!
If you liked this content, find more of my work in my website and GitHub, shared below.
GitHub Repository
Here is the GitHub Repository for this project.
https://github.com/gurezende/Data-Dictionary-GenAI
Find Me
You can find more about my work on my website.
https://gustavorsantos.me
References
[1. Agno Docs] https://docs.agno.com/introduction/agents
[2. Openpyxl Docs] https://openpyxl.readthedocs.io/en/stable/index.html
[3. Streamlit Docs] https://docs.streamlit.io/
[4. Data-Docs Web App] https://excel-datadocs.streamlit.app/
[5. Installing UV] https://docs.astral.sh/uv/getting-started/installation/
[6. Windsurf Coding Copilot] https://windsurf.com/vscode_tutorial
[7. Google Gemini API Key] https://ai.google.dev/gemini-api/docs/api-key