Generating Data Dictionary for Excel Files Using OpenPyxl and AI Agents

Every company I worked for until today, there it was: the resilient MS Excel.

Excel was first released in 1985 and has remained strong until today. It has survived the rise of relational databases, the evolution of many programming languages, the Internet with its infinite number of online applications, and finally, it is also surviving the era of the AI.

Phew!

Do you have any doubts about how resilient Excel is? I don’t.

I think the reason for that is its practicality to start and manipulate a document quickly. Think about this situation: we’re at work, in a meeting, and suddenly the leadership shares a CSV file and asks for a quick calculation or a few calculated numbers. Now, the options are:

1. Open an IDE (or a notebook) and start coding like crazy to generate a simple matplotlib graphic;

2. Open Power BI, import the data, and start creating a report with dynamic graphics.

3. Open the CSV in Excel, write a couple of formulas, and create a graphic.

I can’t speak for you, but many times I go for option 3. Especially because Excel files are compatible with everything, easily shareable, and beginner-friendly.

I am saying all of this as an Introduction to make my point that I don’t think that Excel files are going away anytime soon, even with the fast development of AI. Many will love that, many will hate that.

So, my action here was to leverage AI to make Excel files better documented. One of the main complaints of data teams about Excel is the lack of best practices and reproducibility, given that the names of the columns can have any names and data types, but zero documentation.

So, I have created an AI Agent that reads the Excel file and creates this small documentation. Here is how it works:

The Excel file is converted to CSV and fed into the Large Language Model (LLM).

The AI Agent generates the data dictionary with column information (variable name, data type, description).

The data dictionary gets added as comments to the Excel file’s header.

Output file saved with comments.

Ok. Hands-on now. Let’s get that done in this tutorial.

Code

Let’s code! | Image generated by AI. Meta Llama, 2025. https://meta.ai

We will begin by setting up a virtual environment. Create a venv with the tool of your choice, such as Poetry, Python Venv, Anaconda, or UV. I really like UV, as it is the fastest and the simplest, in my opinion. If you have UV installed [5], open a terminal and create your venv.

uv init data-docs
cd data-docs
uv venv
uv add streamlit openpyxl pandas agno mcp google-genai

Now, let us import the necessary modules. This project was created with Python 3.12.1, but I believe Python 3.9 or higher might do the trick already. We will use:

Agno: for the AI Agent management
OpenPyxl: for the manipulation of Excel files
Streamlit: for the front-end interface.
Pandas, OS, JSON, Dedent and Google Genai as support modules.

# Imports
import os
import json
import streamlit as st
from textwrap import dedent

from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools

from openpyxl import load_workbook
from openpyxl.comments import Comment
import pandas as pd

Great. The next step is creating the functions we’ll need to handle the Excel files and to create the AI Agent.

Notice that all the functions have detailed docstrings. This is intentional because LLMs use docstrings to know what a given function does and decide whether to use it or not as a tool.

So, if you’re using Python functions as Tools for an AI Agent, make sure to use detailed docstrings. Nowadays, with free copilots such as Windsurf [6] it is even easier to create them.

Converting the file to CSV

This function will:

Take the Excel file and read only the first 10 rows. This is enough for us to send to the LLM. Doing that, we are also preventing sending too many tokens as input and making this agent too expensive.
Save the file as CSV to use as input for the AI Agent. The CSV format is easier for the model to take in, as it is a bunch of text separated by commas. And we know LLMs shine working with text.

Here is the function.

def convert_to_csv(file_path:str):
   """
    Use this tool to convert the excel file to CSV.

    * file_path: Path to the Excel file to be converted
    """
   # Load the file  
   df = pd.read_excel(file_path).head(10)

   # Convert to CSV
   st.write("Converting to CSV... :leftwards_arrow_with_hook:")
   return df.to_csv('temp.csv', index=False)

Let’s move on.

Creating the Agent

The next function creates the AI agent. I am using Agno [1], as it is very versatile and easy to use. I also chose the model Gemini 2.0 Flash. During the test phase, this was the best-performing model generating the data docs. To use it, you will need an API Key from Google. Don’t forget to get one here [7].

The function:

Receives the CSV output from the previous function.
Passes through the AI Agent, which generates the data dictionary with column name, description, and data type.
Notice that the description argument is the prompt for the agent. Make it detailed and precise.
The data dictionary will be saved as a JSON file using a tool called FileTools that can read and write files.
I have set up retries=2 so we can work around any error on a first try.

def create_agent(apy_key):
    agent = Agent(
        model=Gemini(id="gemini-2.0-flash", api_key=apy_key),
        description= dedent("""\
                            You are an agent that reads the temp.csv dataset presented to you and 
                            based on the name and data type of each column header, determine the following information:
                            - The data types of each column
                            - The description of each column
                            - The first column numer is 0

                            Using the FileTools provided, create a data dictionary in JSON format that includes the below information:
                            <ColNumber>: ColName: <ColName>, DataType: <DataType>, Description: <Description>

                            If you are unable to determine the data type or description of a column, return 'N/A' for that column for the missing values.
                            \
                            """),
        tools=[ FileTools(read_files=True, save_files=True) ],
        retries=2,
        show_tool_calls=True
        )

    return agent

Ok. Now we need another function to save the data dictionary to the file.

Adding Data Dictionary to the File’s Header

This is the last function to be created. It will:

Get the data dictionary json from the previous step and the original Excel file.
Add the data dictionary to the file’s header as comments.
Save the output file.
Once the file is saved, it displays a download button for the user to get the modified file.

def add_comments_to_header(file_path:str, data_dict:dict="data_dict.json"):
    """
    Use this tool to add the data dictionary data_dict.json as comments to the header of an Excel file and save the output file.

    The function takes the Excel file path as argument and adds the data_dict.json as comments to each cell
    Start counting from column 0
    in the first row of the Excel file, using the following format:    
        * Column Number: <column_number>
        * Column Name: <column_name>
        * Data Type: <data_type>
        * Description: <description>

    Parameters
    ----------
    * file_path : str
        The path to the Excel file to be processed
    * data_dict : dict
        The data dictionary containing the column number, column name, data type, description, and number of null values

    """
    
    # Load the data dictionary
    data_dict = json.load(open(data_dict))

    # Load the workbook
    wb = load_workbook(file_path)

    # Get the active worksheet
    ws = wb.active

    # Iterate over each column in the first row (header)
    for n, col in enumerate(ws.iter_cols(min_row=1, max_row=1)):
        for header_cell in col:
            header_cell.comment = Comment(dedent(f"""\
                              ColName: data_dict[str(n)]['ColName'], 
                              DataType: data_dict[str(n)]['DataType'],
                              Description: data_dict[str(n)]['Description']\
    """),'AI Agent')

    # Save the workbook
    st.write("Saving File... :floppy_disk:")
    wb.save('output.xlsx')

    # Create a download button
    with open('output.xlsx', 'rb') as f:
        st.download_button(
            label="Download output.xlsx",
            data=f,
            file_name='output.xlsx',
            mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
        )

Ok. The next step is to glue all of this together on a Streamlit front-end script.

Streamlit Front-End

In this step, I could have created a different file for the front-end and imported the functions in there. But I decided to use the same file, so let’s start with the famous:

if __name__ == "__main__":

First, a couple of lines to configure the page and messages displayed in the Web Application. We will use the content centered on the page, and there is some information about how the App works.

# Config page Streamlit
    st.set_page_config(layout="centered", 
                       page_title="Data Docs", 
                       page_icon=":paperclip:",
                       initial_sidebar_state="expanded")
    
    # Title
    st.title("Data Docs :paperclip:")
    st.subheader("Generate a data dictionary for your Excel file.")
    st.caption("1. Enter your Gemini API key and the path of the Excel file on the sidebar.")
    st.caption("2. Run the agent.")
    st.caption("3. The agent will generate a data dictionary and add it as comments to the header of the Excel file.")
    st.caption("ColName: <ColName> | DataType: <DataType> | Description: <Description>")
    
    st.divider()

Next, we’ll set up the sidebar, where the user can input their API Key from Google and select a .xlsx file to be modified.

There is a button to run the application, another to reset the app state, and a progress bar. Nothing too fancy.

with st.sidebar:
        # Enter your API key
        st.caption("Enter your API key and the path of the Excel file.")
        api_key = st.text_input("API key: ", placeholder="Google Gemini API key", type="password")
        
        # Upload file
        input_file = st.file_uploader("File upload", 
                                       type='xlsx')
        

        # Run the agent
        agent_run = st.button("Run")

        # progress bar
        progress_bar = st.empty()
        progress_bar.progress(0, text="Initializing...")

        st.divider()

        # Reset session state
        if st.button("Reset Session"):
            st.session_state.clear()
            st.rerun()

Once the run button is clicked, it triggers the rest of the code to run the Agent. Here is the sequence of steps performed:

The first function is called to transform the file to CSV
The progress is registered on the progress bar.
The Agent is created.
Progress bar updated.
A prompt is fed into the agent to read the temp.csv file, create the data dictionary, and save the output to data_dictionary.json.
The data dictionary is printed on the screen, so the user can see what was generated while it’s being saved to the Excel file.
The Excel file is modified and saved.

# Create the agent
    if agent_run:
        # Convert Excel file to CSV
        convert_to_csv(input_file)

        # Register progress
        progress_bar.progress(15, text="Processing CSV...")

        # Create the agent
        agent = create_agent(api_key)

        # Start the script
        st.write("Running Agent... :runner:")

        # Register progress
        progress_bar.progress(50, text="AI Agent is running...")

        # Run the agent    
        agent.print_response(dedent(f"""\
                                1. Use FileTools to read the temp.csv as input to create the data dictionary for the columns in the dataset. 
                                2. Using the FileTools tool, save the data dictionary to a file named 'data_dict.json'.
                                \
                                """),
                        markdown=True)

        # Print the data dictionary
        st.write("Generating Data Dictionary... :page_facing_up:")
        with open('data_dict.json', 'r') as f:
            data_dict = json.load(f)
            st.json(data_dict, expanded=False)

        # Add comments to header
        add_comments_to_header(input_file, 'data_dict.json')

        # Remove temporary files
        st.write("Removing temporary files... :wastebasket:")
        os.remove('temp.csv')
        os.remove('data_dict.json')    
    
    # If file exists, show success message
    if os.path.exists('output.xlsx'):
        st.success("Done! :white_check_mark:")
        os.remove('output.xlsx')

    # Progress bar end
    progress_bar.progress(100, text="Done!")

That’s it. Here is a demonstration of the agent in action.

Data Docs added to your Excel File. Image by the author.

Beautiful result!

Try It

You can try the deployed app here: https://excel-datadocs.streamlit.app/

Before You Go

In my humble opinion, Excel files are not going away anytime soon. Loving or hating them, we’ll have to stick with them for a while.

Excel files are versatile, easy to handle and share, thus they are still very useful for the routine ad-hoc tasks at work.

However, now we can leverage AI to help us handle those files and make them better. Artificial Intelligence is touching so many points of our lives. The routine and tools at work are only another one.

Let’s take advantage of AI and work smarter every day!

If you liked this content, find more of my work in my website and GitHub, shared below.

GitHub Repository

Here is the GitHub Repository for this project.

https://github.com/gurezende/Data-Dictionary-GenAI

Find Me

You can find more about my work on my website.

https://gustavorsantos.me

References

[1. Agno Docs] https://docs.agno.com/introduction/agents

[2. Openpyxl Docs] https://openpyxl.readthedocs.io/en/stable/index.html

[3. Streamlit Docs] https://docs.streamlit.io/

[4. Data-Docs Web App] https://excel-datadocs.streamlit.app/

[5. Installing UV] https://docs.astral.sh/uv/getting-started/installation/

[6. Windsurf Coding Copilot] https://windsurf.com/vscode_tutorial

[7. Google Gemini API Key] https://ai.google.dev/gemini-api/docs/api-key

Generating Data Dictionary for Excel Files Using OpenPyxl and AI Agents

Code

Converting the file to CSV

Creating the Agent

Adding Data Dictionary to the File’s Header

Streamlit Front-End

Try It

Before You Go

GitHub Repository

Find Me

References

Recent Articles

Did Ryan Coogler tease a ‘Sinners’ trilogy?

Russian Hackers Using ClickFix Fake CAPTCHA to Deploy New LOSTKEYS Malware

Unlocking the Future of AI: Building Real-World Applications with Gemini and Imagen | by Komalbhambhu | May, 2025

Hugging Face Releases nanoVLM: A Pure PyTorch Library to Train a Vision-Language Model from Scratch in 750 Lines of Code

Here’s what to look out for

Related Stories

Leave A Reply Cancel reply