Â
Large language models, or LLM, have changed the way we work. By implementing the model capability, the model could improve our work times by generating all the necessary text for the intended tasks.
In data science projects, LLMs can help you in many ways that people have never considered. That’s why this article will guide you in integrating LLMs to support your data science project. The process might not be linear, but each point will help your project differently.
Curious about it? Let’s get into it.
Â
Data Exploration
Â
One of the jobs that data scientists always need to do is to perform data exploration. It’s one of the most tedious and repetitive jobs a data scientist could do.
In this case, we can integrate LLM into our data project by allowing the model to assist in our data exploration phase.
There are many ways to approach this, like asking directly to tools like ChatGPT or Gemini, and then you can copy the code to execute them.
However, we will use a simpler approach, which is using the Pandasai library to help us explore the data with LLM without setting up much of the hard stuff. Let’s start by installing the library to start.
Â
Next, we will set up the LLM we want to use. Many options exist, but this tutorial will only use the OpenAI LLM. We will also use the Titanic example dataset from Kaggle.
from pandasai import SmartDataframe
from pandasai.llm import OpenAI
llm = OpenAI(api_token="YOUR-API-KEY")
sdf = SmartDataframe("titanic.csv", config="llm": llm)
Â
Once the dataset is ready and passed into the SmartDataFrame object, we will use Pandasai to facilitate LLM usage for data exploration.
First, I can ask what the data is about with the following code.
sdf.chat("Can you explain to me what is the dataset about?")
Output>>
The dataset contains information about Titanic passengers, including their survival status, class, name, sex, age, number of siblings/spouses aboard, number of parents/children aboard, ticket number, fare paid, cabin number, and embarkation point.
Â
We can also specify the kind of exploration we want. For example, I want the percentage of missing data.
sdf.chat("What's the missing data percentage from the data?")
Output>>
Age 20.574163
Fare 0.239234
Cabin 78.229665
dtype: float64
Â
It’s also possible to generate a chart by asking the Pandasai to do that.
sdf.chat("Plot a chart of the fare by survived")
Â
Â
You can try it out yourself. Follow the prompt as needed, and Pandasai will use LLM to help with your project quickly.
Â
Feature Engineering
Â
LLM can also be used to discuss and generate new features. For example, using the previous Pandasai approach, we can ask them to develop new features based on our dataset.
sdf.chat("can you think about new features coming from the dataset?")
Â
A few new features are generated according to the dataset. The output is shown in the image below.
Â
Â
If you need more domain-specific feature engineering, we can ask LLM for suggestions on how the features should be or even what kind of data we should collect.
Another thing you can do with LLM is to generate vector embedding from your dataset, especially text data. As the embedding is numerical data, it can be processed further for any downstream tasks you have.
For example, we can generate embedding with OpenAI using the following code.
from openai import OpenAI
import pandas as pd
import numpy as np
client = OpenAI(api_key="YOUR-API-KEY")
data =
"review": [
"The product is excellent and works as expected.",
"Terrible experience, the item broke after one use.",
"Average quality, not worth the price.",
"Great customer service and fast delivery.",
"Poor build quality, but it does the job."
]
df = pd.DataFrame(data)
def get_embedding(text, model="text-embedding-3-small"):
text = text.replace("\n", " ")
response = client.embeddings.create(input=[text], model=model)
return response.data[0].embedding
df["embeddings"] = df["review"].apply(lambda x: get_embedding(x, model="text-embedding-3-small"))
Output>>
[-0.01510944 -0.00573813 -0.07566253 ... 0.01669856 0.01696768
0.00258872
Â
The code above will produce vector embedding, which you can use for further processing.
Â
Model Building
Â
LLMs can also help your data science project by acting as a classifier and assuming the model to classify data. For example, we can use Scikit-LLM, a Python package that enhances text data analytic tasks via LLM to classify text data.
First, we will install the library with the following code.
Â
Then, we can try the library to create text prediction, such as sentiment analysis, with the following code.
from skllm.config import SKLLMConfig
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset
SKLLMConfig.set_openai_key("YOUR-API-KEY")
#label: Positive, Neutral, Negative
X, y = get_classification_dataset()
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)
Output>>
array(['positive', 'positive', 'positive', 'positive', 'positive',
'positive', 'positive', 'positive', 'positive', 'positive',
'negative', 'negative', 'negative', 'negative', 'negative',
'negative', 'negative', 'negative', 'negative', 'negative',
'negative', 'neutral', 'neutral', 'neutral', 'neutral', 'negative',
'negative', 'negative', 'neutral', 'neutral'], dtype="<U8')
Â
LLM can easily be used for the text classifier model without any additional model training. To improve the result, you can also extend it with a few shot examples.
Another example of using synthetic data to support model building and training is generating synthetic data. LLM can produce a similar dataset but not an exact copy of the actual dataset. We can introduce more variation to the data using synthetic data and help the machine learning model generalize well.
Here is an example code for generating synthetic datasets with LLM.
import openai
from openai import OpenAI
import pandas as pd
client = OpenAI(api_key="YOUR-API-KEY")
data =
"job_title": [
"Software Engineer",
"Data Scientist",
"Marketing Specialist",
"HR Manager",
"Financial Analyst"
],
"department": [
"Engineering",
"Data Analytics",
"Marketing",
"Human Resources",
"Finance"
],
"salary": [
"$120,000",
"$110,000",
"$70,000",
"$85,000",
"$95,000"
]
df = pd.DataFrame(data)
def generate_synthetic_data(example_row, instruction="Generate a similar row of employee data:"):
"""
Generates synthetic data using an LLM based on an example row.
"""
prompt = f"instruction\nExample row:\nJob Title: example_row["job_title']\nDepartment: example_row['department']\nSalary: example_row['salary']\nSynthetic row:"
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
"role": "system", "content": "You are a helpful assistant.",
"role": "user", "content": prompt
]
)
return completion.choices[0].message.content.strip()
synthetic_data = df.apply(lambda row: generate_synthetic_data(row), axis=1)
synthetic_rows = [entry.split("\n") for entry in synthetic_data]
synthetic_df = pd.DataFrame(
"job_title": [row[0].split(":")[1].strip() for row in synthetic_rows],
"department": [row[1].split(":")[1].strip() for row in synthetic_rows],
"salary": [row[2].split(":")[1].strip() for row in synthetic_rows]
)
synthetic_df
Â
Â
A simple approach can improve your model. Try out synthetic data generation with your prompt to see if it helps your work.
Â
Conclusion
Â
LLM has changed how we work, and it’s for the better. Integrating LLM into a data science project is one of the use cases that the model could do. In this article, we explore how we can incorporate LLM into your project, including:
- Data Exploration
- Feature Engineering
- Model Building
I hope this has helped!
Â
Â
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.