Pandas is one of the essential libraries in data science projects. It simplifies tasks such as data loading, cleaning, grouping, and analyzing, allowing you to perform fast, flexible analyses on data. Here’s a guide to exploring the 10 most commonly used and fundamental features of the Pandas library in data science!
We start with the primary data structures of Pandas: DataFrame and Series. A DataFrame represents a table of rows and columns, while a Series can be thought of as a single column in that table.
import pandas as pd
data = {
'name': ['Ali', 'AyÅŸe', 'Fatma', 'Ahmet', 'Zeynep'],
'city': ['Istanbul', 'Ankara', 'Istanbul', 'Izmir', 'Ankara'],
'age': [25, 30, 22, 40, 30],
'spending': [200, 300, 250, 400, 320],
'cash_in_hand': [100, 150, 120, 180, 200]
}
df = pd.DataFrame(data)
print(df)
Output:
name city age spending cash_in_hand
0 Ali Istanbul 25 200 100
1 AyÅŸe Ankara 30 300 150
2 Fatma Istanbul 22 250 120
3 Ahmet Izmir 40 400 180
4 Zeynep Ankara 30 320 200
We can quickly load data files in various formats with Pandas. .read_csv()
and .read_excel()
are especially practical for working with common formats like .csv
. With the head()
function, you can view the first 5 rows or specify the number of rows you want to see inside head()
.
Note: To write data, you can use
.to_csv()
and.to_excel()
.
df = pd.read_csv('path_to_csv') # Used for reading CSV files.
print(df.head())
Output (first 5 rows):
name city age spending cash_in_hand
0 Ali Istanbul 25 200 100
1 AyÅŸe Ankara 30 300 150
2 Fatma Istanbul 22 250 120
3 Ahmet Izmir 40 400 180
4 Zeynep Ankara 30 320 200
It’s important to get a quick overview when working with a new dataset. You can obtain general information about the data using functions like info()
and describe()
.
print(df.info())
print(df.describe())
Output (info() function):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null object
1 city 5 non-null object
2 age 5 non-null int64
3 spending 5 non-null int64
4 cash_in_hand 5 non-null int64
dtypes: int64(3), object(2)
memory usage: 328.0+ bytes
Output (describe() function):
age spending cash_in_hand
count 5.000000 5.000000 5.000000
mean 29.400000 294.000000 150.000000
std 6.841053 75.365775 41.231056
min 22.000000 200.000000 100.000000
25% 25.000000 250.000000 120.000000
50% 30.000000 300.000000 150.000000
75% 30.000000 320.000000 180.000000
max 40.000000 400.000000 200.000000
Filtering data based on specific conditions and selecting specific columns helps focus on certain aspects of the data during analysis.
age_filter = df[df['age'] > 25]
print(age_filter)
Output:
name city age spending cash_in_hand
1 AyÅŸe Ankara 30 300 150
3 Ahmet Izmir 40 400 180
4 Zeynep Ankara 30 320 200
Cleaning the data before starting analysis is essential. You can remove missing or incorrect data and transform it into appropriate formats.
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({
'name': ['Ali', 'AyÅŸe', 'Fatma', 'Ahmet', None],
'age': [25, 30, None, 40, 30],
'spending': [200, None, 250, 400, 320]
})
# Displaying data
print("Original DataFrame:")
print(df)
# Dropping rows with missing data
df_clean = df.dropna()
# Result
print("\nAfter Dropping Missing Values:")
print(df_clean)
Original DataFrame:
name age spending
0 Ali 25.0 200.0
1 AyÅŸe 30.0 NaN
2 Fatma NaN 250.0
3 Ahmet 40.0 400.0
4 None 30.0 320.0
After Dropping Missing Values:
name age spending
0 Ali 25.0 200.0
3 Ahmet 40.0 400.0
Note:
axis=0
(default) removes rows with missing data;axis=1
removes columns with missing data.
5.1: Dropping Missing Values in Specific Columns
If you want to drop rows with missing values only in specific columns, you can use the subset
parameter. In the example below, rows with missing values in the age
and spending
columns are removed, while empty rows in the name
column are retained.
df_clean = df.dropna(subset=['age', 'spending'])
print("\nMissing Values Dropped from Age and Spending Columns:")
print(df_clean)
Output:
name age spending
0 Ali 25.0 200.0
3 Ahmet 40.0 400.0
4 None 30.0 320.0
5.2: Conditional Deletion with More Detail
thresh=n
: Each row must have at least n
valid (non-NaN) values to remain. If a row has fewer than n
valid values, it is deleted.
thresh=2
: A row will be deleted if it does not contain at least 2 valid values.thresh=3
: A row will be deleted if it does not contain at least 3 valid values.
The how
parameter defines the approach to handling missing values and the conditions under which rows will be deleted.
how='any'
: Deletes any row containing at least one NaN value.how='all'
: Deletes rows where all values are NaN.
These parameters help control how you manage missing data during the cleaning process.
Example:
data = {
'Name': ['Ali', 'Ahmet', 'AyÅŸe', 'Zeynep', 'Can'],
'Age': [20, 22, None, 24, None],
'Grade': [85, None, 88, None, 92],
'City': [None, 'Ankara', 'Izmir', None, 'Istanbul']
}
df = pd.DataFrame(data)# thresh=3: Row must have at least 3 valid values
df_clean_thresh = df.dropna(thresh=3)
print(df_clean_thresh)
Output:
Name Age Grade City
0 Ali 20.0 85.0 None
1 Ahmet 22.0 NaN Ankara
2 AyÅŸe NaN 88.0 Izmir
4 Can NaN 92.0 Istanbul
In this case, we set thresh=3
, meaning each row must contain at least 3 valid values. Row 3 (Zeynep
), which has only 2 valid values, was removed.
# how='any': Deletes rows containing any NaN values
df_clean_any = df.dropna(how='any')
print(df_clean_any)
Output: All rows were deleted since each contains at least one NaN value.
# how='all': Deletes rows where all values are NaN
df_clean_all = df.dropna(how='all')
print(df_clean_all)
Output:
Name Age Grade City
0 Ali 20.0 85.0 None
1 Ahmet 22.0 NaN Ankara
2 AyÅŸe NaN 88.0 Izmir
3 Zeynep 24.0 NaN None
4 Can NaN 92.0 Istanbul
Since each row has at least one valid (non-NaN) value, none were deleted.
Grouping data by specific categories and performing aggregation operations on each group is a commonly used method. The groupby()
function is one of the most powerful and useful functions in the pandas library. This function allows you to group your data based on a certain criterion. For example, you can group data by age or by gender. After performing a grouping operation, you can make various calculations on these groups (e.g., average, sum, count, etc.).
To explain with a basic example:
Let’s assume that in a class, we have students’ ages and grades. We want to group these students by age and calculate the average grade for each age group. This is where groupby()
comes into play. However, if we perform the grouping as shown below, we will encounter an error:
data = {
'name': ['Ali', 'AyÅŸe', 'Fatma', 'Ahmet', 'Zeynep'],
'city': ['Istanbul', 'Ankara', 'Istanbul', 'Izmir', 'Ankara'],
'age': [25, 30, 22, 40, 30],
'spending': [200, 300, 250, 400, 320],
'cash_in_hand': [100, 150, 120, 180, 200]
}
df = pd.DataFrame(data)grouped_df = df.groupby('age').mean()
print(grouped_df)
The reason for this error is that when we try to apply the mean()
function to the groupby()
result, there are some non-numeric (string, object-type) columns, such as ‘name’ and ‘city’, that cannot be averaged. The mean()
function can only be applied to numeric data. A few solutions exist for this, such as calculating the average for a specific column:
grouped_df = df.groupby('age')['spending'].mean()
Alternatively, we can select only the numeric columns either manually or automatically:
# Selecting numeric columns
numeric_df = df[['age', 'spending', 'cash_in_hand']]# Automatically selecting numeric columns
numeric_df = df.select_dtypes(include=['number'])
# Getting averages by group
grouped_df = numeric_df.groupby(df['age']).mean()
Note:
select_dtypes(include=['number'])
: This function selects all numeric columns in the DataFrame. The ‘number’ type includes all numeric data types (e.g.,int64
,float64
).
Output:
age spending cash_in_hand
age
22 22.0 250.0 120.0
25 25.0 200.0 100.0
30 30.0 310.0 175.0
40 40.0 400.0 180.0
Note: Don’t limit yourself to just using the
groupby()
function. Other operations can be performed as well, such as sum (sum()
), count (count()
), maximum (max()
), etc.
If you want to perform multiple operations at once, you need to use the agg()
function. agg()
is short for “aggregate” (which means “to collect or combine”), and I believe it gets this name because it allows you to apply multiple functions together. In pandas, aggregate functions are typically summary operations performed on a group. These functions allow you to perform various calculations on grouped data. The agg()
function enables you to apply multiple aggregate functions at once.
# Grouping by age and calculating both the average and sum of the grades
grouped_df = numeric_df.groupby('age').agg(['mean', 'sum'])
Output:
spending cash_in_hand
mean sum mean sum
age
22 250.0 250 120.0 120
25 200.0 200 100.0 100
30 310.0 620 175.0 350
40 400.0 400 180.0 180
Grouping by multiple variables provides deeper insights. For example, you could calculate the average spending in different age groups for each city. However, if you try to do this as shown below, you’ll encounter an error:
grouped_df = df.groupby(['city', 'age']).mean()
print(grouped_df)
The reason for this is the same as we mentioned earlier: the ‘city’ column is not a numeric value. So, does this mean we can’t group by categorical values? Of course not! The solution is as follows:
grouped_df = df.groupby(['city', 'age']).mean(numeric_only=True)
print(grouped_df)
The numeric_only=True
parameter allows you to calculate the mean only for numeric columns.
Output:
spending cash_in_hand
city age
Ankara 30 310.0 175.0
Istanbul 22 250.0 120.0
25 200.0 100.0
Izmir 40 400.0 180.0
Grouping data into specific ranges is particularly useful for numerical data. You can define intervals using pd.cut()
and perform grouping accordingly. For example, to group ages into intervals of 5 years:
intervals = list(range(20, 55, 5))
df['age_range'] = pd.cut(df['age'], bins=intervals)
# Or
df['age_range'] = pd.cut(df['age'], bins=6)
Then, you can group by both numerical and categorical columns and calculate averages:
# Select only numeric and categorical columns
numeric_df = df.select_dtypes(include=['number', 'category'])# Group by age ranges and calculate the average
interval_group = numeric_df.groupby('age_range').mean()
print(interval_group)
Note: The
bins
parameter is used to split the numerical data into specific ranges. Thepd.cut()
function uses these ranges to convert continuous data into categorical data.
Output:
age spending cash_in_hand
age_range
(21.982, 25.0] 23.5 225.0 110.0
(25.0, 28.0] NaN NaN NaN
(28.0, 31.0] 30.0 310.0 175.0
(31.0, 34.0] NaN NaN NaN
(34.0, 37.0] NaN NaN NaN
(37.0, 40.0] 40.0 400.0 180.0
You can perform richer analyses by merging different datasets. The merge()
function allows you to combine data based on common columns. This function works similarly to the SQL JOIN
operation and merges datasets based on a common key or column. In other words, it combines two DataFrames by matching data in one or more columns. By default, it works with how='inner'
.
import pandas as pd# Left DataFrame
df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Ali', 'AyÅŸe', 'Mehmet']
})
# Right DataFrame
df2 = pd.DataFrame({
'id': [2, 3, 4],
'city': ['Istanbul', 'Ankara', 'Bursa']
})
# Inner join, based on the common 'id' column
result = pd.merge(df1, df2, how='inner', on='id')
print(result)
Output:
id name city
0 2 AyÅŸe Istanbul
1 3 Mehmet Ankara
A Closer Look
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'))
Parameters:
left
: The first DataFrame.right
: The second DataFrame.how
: Specifies the type of merge. The default is'inner'
. Different types of merges can be performed.on
: The name of the common column. This is the column that exists in both the left and right DataFrame.left_on
: The column name to merge from the left DataFrame.right_on
: The column name to merge from the right DataFrame.left_index
: If set toTrue
, the left DataFrame’s index will be used for the merge.right_index
: If set toTrue
, the right DataFrame’s index will be used for the merge.suffixes
: Suffixes to be added to overlapping column names. By default,('_x', '_y')
is used.
how
Parameter
The how
parameter specifies the type of merge to be performed. This is similar to SQL JOIN
types. The possible values for how
are:
- ‘inner’ (default): Only the common rows from both DataFrames are kept. This means that rows with matching key values from both DataFrames are merged. It corresponds to SQL’s
INNER JOIN
. - ‘outer’: All rows from both DataFrames are included, with missing values filled as
NaN
. It corresponds to SQL’sFULL OUTER JOIN
. - ‘left’: All rows from the left DataFrame are kept, and only matching rows from the right DataFrame are included. For rows that don’t match in the right DataFrame,
NaN
values are used. This corresponds to SQL’sLEFT JOIN
. - ‘right’: All rows from the right DataFrame are kept, and only matching rows from the left DataFrame are included. For rows that don’t match in the left DataFrame,
NaN
values are used. This corresponds to SQL’sRIGHT JOIN
.
on
, left_on
, and right_on
Parameters
These parameters define which columns are used to perform the merge:
on
: If the columns to be merged are the same in both DataFrames, you can use theon
parameter to specify the common column.left_on
andright_on
: If the columns to be merged have different names in each DataFrame, you can useleft_on
andright_on
to specify the column names in the left and right DataFrame respectively.
left_index
and right_index
Parameters
If the merge operation is based on the index rather than columns, you can set left_index=True
and/or right_index=True
.
10. Time Series Analysis
Pandas makes it easy to analyze time series data. You can convert date information using pd.to_datetime()
and use it as an index for time-based analysis.
import pandas as pd# Let's create a DataFrame with some example data
data = {
'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-02-01', '2024-05-01'],
'value': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)
# Convert the 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'])
# Set the 'date' column as the index
df.set_index('date', inplace=True)
# Select a specific date range in 2024
specific_date_range = df['2024-01-01':'2024-02-29']
# Print the results
print(specific_date_range)
Note: Remember that the dates must be valid, so you can’t search for non-existent dates like February 30th or the 13th month!
Output:
value
date
2024-01-01 100
2024-01-02 200
2024-01-03 300
2024-02-01 400
With time series analysis, you can easily explore trends and seasonal changes.