Data analysis is a crucial step in making data-driven decisions. However, manually analyzing data and deriving insights can be time-consuming. Fortunately, autonomous AI agents can perform such tasks efficiently. This article walks through the process of building a data analyst agent and a machine learning agent to analyze and predict survival rates using the Titanic dataset from Kaggle. We will leverage smolagents, transformers, seaborn, and sklearn to build an automated data analysis pipeline.
Before setting up the agent, install the necessary Python libraries by running:
!pip install seaborn smolagents transformers -q -U
These packages help with data visualization, AI model inference, and automated code execution.
A CodeAgent can execute Python code, making it ideal for data analysis. We use the meta-llama/Llama-3.1-70B-Instruct
model from Hugging Face to power our agent. First, we import the required modules and log in to Hugging Face:
from smolagents import HfApiModel, CodeAgent
from huggingface_hub import login
import oslogin(os.getenv("HUGGINGFACEHUB_API_TOKEN"))
model = HfApiModel("meta-llama/Llama-3.1-70B-Instruct")
Now, we instantiate the CodeAgent with authorized libraries for data science tasks:
agent = CodeAgent(
tools=[],
model=model,
additional_authorized_imports=["numpy", "pandas", "matplotlib.pyplot", "seaborn"],
max_iterations=10,
)
This setup allows the agent to execute data analysis operations autonomously.
Before running the agent, create a folder to save visualizations:
import os
os.mkdir("./figures")
Next, define notes about dataset variables to help guide the agent:
additional_notes = """
### Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper, 2nd = Middle, 3rd = Lower
age: Age is fractional if less than 1. If estimated, it is in the form xx.5
sibsp: Number of siblings/spouses aboard
parch: Number of parents/children aboard
"""