Pandas Library: 10 Essential Features for Starting with Data Science | by NasuhcaN | Nov, 2024

Pandas is one of the essential libraries in data science projects. It simplifies tasks such as data loading, cleaning, grouping, and analyzing, allowing you to perform fast, flexible analyses on data. Here’s a guide to exploring the 10 most commonly used and fundamental features of the Pandas library in data science!

We start with the primary data structures of Pandas: DataFrame and Series. A DataFrame represents a table of rows and columns, while a Series can be thought of as a single column in that table.

import pandas as pd 
data = {
'name': ['Ali', 'Ayşe', 'Fatma', 'Ahmet', 'Zeynep'],
'city': ['Istanbul', 'Ankara', 'Istanbul', 'Izmir', 'Ankara'],
'age': [25, 30, 22, 40, 30],
'spending': [200, 300, 250, 400, 320],
'cash_in_hand': [100, 150, 120, 180, 200]
}
df = pd.DataFrame(data)
print(df)

Output:

name     city  age  spending  cash_in_hand
0     Ali  Istanbul   25      200            100
1    Ayşe    Ankara   30      300            150
2   Fatma  Istanbul   22      250            120
3   Ahmet     Izmir   40      400            180
4  Zeynep    Ankara   30      320            200

We can quickly load data files in various formats with Pandas. .read_csv() and .read_excel() are especially practical for working with common formats like .csv. With the head() function, you can view the first 5 rows or specify the number of rows you want to see inside head().

Note: To write data, you can use .to_csv() and .to_excel().

df = pd.read_csv('path_to_csv') # Used for reading CSV files.
print(df.head())

Output (first 5 rows):

name     city  age  spending  cash_in_hand
0     Ali  Istanbul   25      200            100
1    Ayşe    Ankara   30      300            150
2   Fatma  Istanbul   22      250            120
3   Ahmet     Izmir   40      400            180
4  Zeynep    Ankara   30      320            200

It’s important to get a quick overview when working with a new dataset. You can obtain general information about the data using functions like info() and describe().

print(df.info())
print(df.describe())

Output (info() function):

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4        
Data columns (total 5 columns):      
#   Column           Non-Null Count  Dtype
---  ------           --------------  -----
0   name             5 non-null      object
1   city             5 non-null      object
2   age              5 non-null      int64
3   spending         5 non-null      int64
4   cash_in_hand     5 non-null      int64
dtypes: int64(3), object(2)
memory usage: 328.0+ bytes

Output (describe() function):

age   spending  cash_in_hand
count   5.000000    5.000000        5.000000
mean   29.400000  294.000000      150.000000
std     6.841053   75.365775       41.231056
min    22.000000  200.000000      100.000000
25%    25.000000  250.000000      120.000000
50%    30.000000  300.000000      150.000000
75%    30.000000  320.000000      180.000000
max    40.000000  400.000000      200.000000

Filtering data based on specific conditions and selecting specific columns helps focus on certain aspects of the data during analysis.

age_filter = df[df['age'] > 25]
print(age_filter)

Output:

name   city  age  spending  cash_in_hand
1    Ayşe  Ankara   30      300            150
3   Ahmet   Izmir   40      400            180
4  Zeynep  Ankara   30      320            200

Cleaning the data before starting analysis is essential. You can remove missing or incorrect data and transform it into appropriate formats.

import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({
'name': ['Ali', 'Ayşe', 'Fatma', 'Ahmet', None],
'age': [25, 30, None, 40, 30],
'spending': [200, None, 250, 400, 320]
})
# Displaying data
print("Original DataFrame:")
print(df)
# Dropping rows with missing data
df_clean = df.dropna()
# Result
print("\nAfter Dropping Missing Values:")
print(df_clean)

Original DataFrame:

name   age  spending
0     Ali  25.0     200.0
1    Ayşe  30.0       NaN
2   Fatma   NaN     250.0
3   Ahmet  40.0     400.0
4    None  30.0     320.0

After Dropping Missing Values:

name   age  spending
0     Ali  25.0     200.0
3   Ahmet  40.0     400.0

Note: axis=0 (default) removes rows with missing data; axis=1 removes columns with missing data.

5.1: Dropping Missing Values in Specific Columns

If you want to drop rows with missing values only in specific columns, you can use the subset parameter. In the example below, rows with missing values in the age and spending columns are removed, while empty rows in the name column are retained.

df_clean = df.dropna(subset=['age', 'spending'])
print("\nMissing Values Dropped from Age and Spending Columns:")
print(df_clean)

Output:

name   age  spending
0    Ali  25.0    200.0
3  Ahmet  40.0    400.0
4   None  30.0    320.0

5.2: Conditional Deletion with More Detail

thresh=n: Each row must have at least n valid (non-NaN) values to remain. If a row has fewer than n valid values, it is deleted.

thresh=2: A row will be deleted if it does not contain at least 2 valid values.
thresh=3: A row will be deleted if it does not contain at least 3 valid values.

The how parameter defines the approach to handling missing values and the conditions under which rows will be deleted.

how='any': Deletes any row containing at least one NaN value.
how='all': Deletes rows where all values are NaN.

These parameters help control how you manage missing data during the cleaning process.

Example:

data = {
'Name': ['Ali', 'Ahmet', 'Ayşe', 'Zeynep', 'Can'],
'Age': [20, 22, None, 24, None],
'Grade': [85, None, 88, None, 92],
'City': [None, 'Ankara', 'Izmir', None, 'Istanbul']
}

df = pd.DataFrame(data)# thresh=3: Row must have at least 3 valid values
df_clean_thresh = df.dropna(thresh=3)
print(df_clean_thresh)

Output:

Name   Age  Grade       City
0    Ali  20.0   85.0       None
1  Ahmet  22.0    NaN     Ankara
2   Ayşe   NaN   88.0      Izmir
4    Can   NaN   92.0   Istanbul

In this case, we set thresh=3, meaning each row must contain at least 3 valid values. Row 3 (Zeynep), which has only 2 valid values, was removed.

# how='any': Deletes rows containing any NaN values
df_clean_any = df.dropna(how='any')
print(df_clean_any)

Output: All rows were deleted since each contains at least one NaN value.

# how='all': Deletes rows where all values are NaN
df_clean_all = df.dropna(how='all')
print(df_clean_all)

Output:

Name   Age  Grade       City
0     Ali  20.0   85.0       None
1   Ahmet  22.0    NaN     Ankara
2    Ayşe   NaN   88.0      Izmir
3  Zeynep  24.0    NaN       None
4     Can   NaN   92.0   Istanbul

Since each row has at least one valid (non-NaN) value, none were deleted.

Grouping data by specific categories and performing aggregation operations on each group is a commonly used method. The groupby() function is one of the most powerful and useful functions in the pandas library. This function allows you to group your data based on a certain criterion. For example, you can group data by age or by gender. After performing a grouping operation, you can make various calculations on these groups (e.g., average, sum, count, etc.).

To explain with a basic example:

Let’s assume that in a class, we have students’ ages and grades. We want to group these students by age and calculate the average grade for each age group. This is where groupby() comes into play. However, if we perform the grouping as shown below, we will encounter an error:

data = {
'name': ['Ali', 'Ayşe', 'Fatma', 'Ahmet', 'Zeynep'],
'city': ['Istanbul', 'Ankara', 'Istanbul', 'Izmir', 'Ankara'],
'age': [25, 30, 22, 40, 30],
'spending': [200, 300, 250, 400, 320],
'cash_in_hand': [100, 150, 120, 180, 200]
}

df = pd.DataFrame(data)grouped_df = df.groupby('age').mean()
print(grouped_df)

The reason for this error is that when we try to apply the mean() function to the groupby() result, there are some non-numeric (string, object-type) columns, such as ‘name’ and ‘city’, that cannot be averaged. The mean() function can only be applied to numeric data. A few solutions exist for this, such as calculating the average for a specific column:

grouped_df = df.groupby('age')['spending'].mean()

Alternatively, we can select only the numeric columns either manually or automatically:

# Selecting numeric columns
numeric_df = df[['age', 'spending', 'cash_in_hand']]# Automatically selecting numeric columns
numeric_df = df.select_dtypes(include=['number'])
# Getting averages by group
grouped_df = numeric_df.groupby(df['age']).mean()

Note: select_dtypes(include=['number']): This function selects all numeric columns in the DataFrame. The ‘number’ type includes all numeric data types (e.g., int64, float64).

Output:

      age  spending  cash_in_hand
age
22   22.0    250.0          120.0
25   25.0    200.0          100.0
30   30.0    310.0          175.0
40   40.0    400.0          180.0

Note: Don’t limit yourself to just using the groupby() function. Other operations can be performed as well, such as sum (sum()), count (count()), maximum (max()), etc.

If you want to perform multiple operations at once, you need to use the agg() function. agg() is short for “aggregate” (which means “to collect or combine”), and I believe it gets this name because it allows you to apply multiple functions together. In pandas, aggregate functions are typically summary operations performed on a group. These functions allow you to perform various calculations on grouped data. The agg() function enables you to apply multiple aggregate functions at once.

# Grouping by age and calculating both the average and sum of the grades
grouped_df = numeric_df.groupby('age').agg(['mean', 'sum'])

Output:

        spending      cash_in_hand     
mean  sum            mean  sum
age
22    250.0  250           120.0  120
25    200.0  200           100.0  100
30    310.0  620           175.0  350
40    400.0  400           180.0  180

Grouping by multiple variables provides deeper insights. For example, you could calculate the average spending in different age groups for each city. However, if you try to do this as shown below, you’ll encounter an error:

grouped_df = df.groupby(['city', 'age']).mean()
print(grouped_df)

The reason for this is the same as we mentioned earlier: the ‘city’ column is not a numeric value. So, does this mean we can’t group by categorical values? Of course not! The solution is as follows:

grouped_df = df.groupby(['city', 'age']).mean(numeric_only=True)
print(grouped_df)

The numeric_only=True parameter allows you to calculate the mean only for numeric columns.

Output:

spending  cash_in_hand
city    age
Ankara   30     310.0            175.0
Istanbul 22     250.0            120.0
25     200.0            100.0
Izmir    40     400.0            180.0

Grouping data into specific ranges is particularly useful for numerical data. You can define intervals using pd.cut() and perform grouping accordingly. For example, to group ages into intervals of 5 years:

intervals = list(range(20, 55, 5))
df['age_range'] = pd.cut(df['age'], bins=intervals)
# Or
df['age_range'] = pd.cut(df['age'], bins=6)

Then, you can group by both numerical and categorical columns and calculate averages:

# Select only numeric and categorical columns
numeric_df = df.select_dtypes(include=['number', 'category'])# Group by age ranges and calculate the average
interval_group = numeric_df.groupby('age_range').mean()
print(interval_group)

Note: The bins parameter is used to split the numerical data into specific ranges. The pd.cut() function uses these ranges to convert continuous data into categorical data.

Output:

                age  spending  cash_in_hand
age_range
(21.982, 25.0]  23.5    225.0            110.0
(25.0, 28.0]     NaN      NaN              NaN
(28.0, 31.0]    30.0    310.0            175.0
(31.0, 34.0]     NaN      NaN              NaN
(34.0, 37.0]     NaN      NaN              NaN
(37.0, 40.0]    40.0    400.0            180.0

You can perform richer analyses by merging different datasets. The merge() function allows you to combine data based on common columns. This function works similarly to the SQL JOIN operation and merges datasets based on a common key or column. In other words, it combines two DataFrames by matching data in one or more columns. By default, it works with how='inner'.

import pandas as pd# Left DataFrame
df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Ali', 'Ayşe', 'Mehmet']
})
# Right DataFrame
df2 = pd.DataFrame({
'id': [2, 3, 4],
'city': ['Istanbul', 'Ankara', 'Bursa']
})
# Inner join, based on the common 'id' column
result = pd.merge(df1, df2, how='inner', on='id')
print(result)

Output:

id      name     city
0   2    Ayşe  Istanbul
1   3  Mehmet    Ankara

A Closer Look

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'))

Parameters:

left: The first DataFrame.
right: The second DataFrame.
how: Specifies the type of merge. The default is 'inner'. Different types of merges can be performed.
on: The name of the common column. This is the column that exists in both the left and right DataFrame.
left_on: The column name to merge from the left DataFrame.
right_on: The column name to merge from the right DataFrame.
left_index: If set to True, the left DataFrame’s index will be used for the merge.
right_index: If set to True, the right DataFrame’s index will be used for the merge.
suffixes: Suffixes to be added to overlapping column names. By default, ('_x', '_y') is used.

how Parameter

The how parameter specifies the type of merge to be performed. This is similar to SQL JOIN types. The possible values for how are:

‘inner’ (default): Only the common rows from both DataFrames are kept. This means that rows with matching key values from both DataFrames are merged. It corresponds to SQL’s INNER JOIN.
‘outer’: All rows from both DataFrames are included, with missing values filled as NaN. It corresponds to SQL’s FULL OUTER JOIN.
‘left’: All rows from the left DataFrame are kept, and only matching rows from the right DataFrame are included. For rows that don’t match in the right DataFrame, NaN values are used. This corresponds to SQL’s LEFT JOIN.
‘right’: All rows from the right DataFrame are kept, and only matching rows from the left DataFrame are included. For rows that don’t match in the left DataFrame, NaN values are used. This corresponds to SQL’s RIGHT JOIN.

on, left_on, and right_on Parameters

These parameters define which columns are used to perform the merge:

on: If the columns to be merged are the same in both DataFrames, you can use the on parameter to specify the common column.
left_on and right_on: If the columns to be merged have different names in each DataFrame, you can use left_on and right_on to specify the column names in the left and right DataFrame respectively.

left_index and right_index Parameters

If the merge operation is based on the index rather than columns, you can set left_index=True and/or right_index=True.

10. Time Series Analysis

Pandas makes it easy to analyze time series data. You can convert date information using pd.to_datetime() and use it as an index for time-based analysis.

import pandas as pd# Let's create a DataFrame with some example data
data = {
'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-02-01', '2024-05-01'],
'value': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)
# Convert the 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'])
# Set the 'date' column as the index
df.set_index('date', inplace=True)
# Select a specific date range in 2024
specific_date_range = df['2024-01-01':'2024-02-29']
# Print the results
print(specific_date_range)

Note: Remember that the dates must be valid, so you can’t search for non-existent dates like February 30th or the 13th month!

Output:

              value
date
2024-01-01    100
2024-01-02    200
2024-01-03    300
2024-02-01    400

With time series analysis, you can easily explore trends and seasonal changes.

Pandas Library: 10 Essential Features for Starting with Data Science | by NasuhcaN | Nov, 2024

5.1: Dropping Missing Values in Specific Columns

5.2: Conditional Deletion with More Detail

A Closer Look

10. Time Series Analysis

Recent Articles

Advanced Time Intelligence in DAX with Performance in Mind

North Korean Hackers Target Freelance Developers in Job Scam to Deploy Malware

Nvidia’s Big Plan to Beat Back RTX 5090 Scalpers Is the Same as Everybody Else

Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge

KGGen: Advancing Knowledge Graph Extraction with Language Models and Clustering Techniques

Related Stories

Leave A Reply Cancel reply