Pandas Library: 10 Essential Features for Starting with Data Science | by NasuhcaN | Nov, 2024


Pandas is one of the essential libraries in data science projects. It simplifies tasks such as data loading, cleaning, grouping, and analyzing, allowing you to perform fast, flexible analyses on data. Here’s a guide to exploring the 10 most commonly used and fundamental features of the Pandas library in data science!

We start with the primary data structures of Pandas: DataFrame and Series. A DataFrame represents a table of rows and columns, while a Series can be thought of as a single column in that table.

import pandas as pd 
data = {
'name': ['Ali', 'AyÅŸe', 'Fatma', 'Ahmet', 'Zeynep'],
'city': ['Istanbul', 'Ankara', 'Istanbul', 'Izmir', 'Ankara'],
'age': [25, 30, 22, 40, 30],
'spending': [200, 300, 250, 400, 320],
'cash_in_hand': [100, 150, 120, 180, 200]
}
df = pd.DataFrame(data)
print(df)

Output:

name     city  age  spending  cash_in_hand
0 Ali Istanbul 25 200 100
1 AyÅŸe Ankara 30 300 150
2 Fatma Istanbul 22 250 120
3 Ahmet Izmir 40 400 180
4 Zeynep Ankara 30 320 200

We can quickly load data files in various formats with Pandas. .read_csv() and .read_excel() are especially practical for working with common formats like .csv. With the head() function, you can view the first 5 rows or specify the number of rows you want to see inside head().

Note: To write data, you can use .to_csv() and .to_excel().

df = pd.read_csv('path_to_csv') # Used for reading CSV files.
print(df.head())

Output (first 5 rows):

name     city  age  spending  cash_in_hand
0 Ali Istanbul 25 200 100
1 AyÅŸe Ankara 30 300 150
2 Fatma Istanbul 22 250 120
3 Ahmet Izmir 40 400 180
4 Zeynep Ankara 30 320 200

It’s important to get a quick overview when working with a new dataset. You can obtain general information about the data using functions like info() and describe().

print(df.info())
print(df.describe())

Output (info() function):

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null object
1 city 5 non-null object
2 age 5 non-null int64
3 spending 5 non-null int64
4 cash_in_hand 5 non-null int64
dtypes: int64(3), object(2)
memory usage: 328.0+ bytes

Output (describe() function):

age   spending  cash_in_hand
count 5.000000 5.000000 5.000000
mean 29.400000 294.000000 150.000000
std 6.841053 75.365775 41.231056
min 22.000000 200.000000 100.000000
25% 25.000000 250.000000 120.000000
50% 30.000000 300.000000 150.000000
75% 30.000000 320.000000 180.000000
max 40.000000 400.000000 200.000000

Filtering data based on specific conditions and selecting specific columns helps focus on certain aspects of the data during analysis.

age_filter = df[df['age'] > 25]
print(age_filter)

Output:

name   city  age  spending  cash_in_hand
1 AyÅŸe Ankara 30 300 150
3 Ahmet Izmir 40 400 180
4 Zeynep Ankara 30 320 200

Cleaning the data before starting analysis is essential. You can remove missing or incorrect data and transform it into appropriate formats.

import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({
'name': ['Ali', 'AyÅŸe', 'Fatma', 'Ahmet', None],
'age': [25, 30, None, 40, 30],
'spending': [200, None, 250, 400, 320]
})
# Displaying data
print("Original DataFrame:")
print(df)
# Dropping rows with missing data
df_clean = df.dropna()
# Result
print("\nAfter Dropping Missing Values:")
print(df_clean)

Original DataFrame:

name   age  spending
0 Ali 25.0 200.0
1 AyÅŸe 30.0 NaN
2 Fatma NaN 250.0
3 Ahmet 40.0 400.0
4 None 30.0 320.0

After Dropping Missing Values:

name   age  spending
0 Ali 25.0 200.0
3 Ahmet 40.0 400.0

Note: axis=0 (default) removes rows with missing data; axis=1 removes columns with missing data.

5.1: Dropping Missing Values in Specific Columns

If you want to drop rows with missing values only in specific columns, you can use the subset parameter. In the example below, rows with missing values in the age and spending columns are removed, while empty rows in the name column are retained.

df_clean = df.dropna(subset=['age', 'spending'])
print("\nMissing Values Dropped from Age and Spending Columns:")
print(df_clean)

Output:

name   age  spending
0 Ali 25.0 200.0
3 Ahmet 40.0 400.0
4 None 30.0 320.0

5.2: Conditional Deletion with More Detail

thresh=n: Each row must have at least n valid (non-NaN) values to remain. If a row has fewer than n valid values, it is deleted.

  • thresh=2: A row will be deleted if it does not contain at least 2 valid values.
  • thresh=3: A row will be deleted if it does not contain at least 3 valid values.

The how parameter defines the approach to handling missing values and the conditions under which rows will be deleted.

  • how='any': Deletes any row containing at least one NaN value.
  • how='all': Deletes rows where all values are NaN.

These parameters help control how you manage missing data during the cleaning process.

Example:

data = {
'Name': ['Ali', 'Ahmet', 'AyÅŸe', 'Zeynep', 'Can'],
'Age': [20, 22, None, 24, None],
'Grade': [85, None, 88, None, 92],
'City': [None, 'Ankara', 'Izmir', None, 'Istanbul']
}
df = pd.DataFrame(data)

# thresh=3: Row must have at least 3 valid values
df_clean_thresh = df.dropna(thresh=3)
print(df_clean_thresh)

Output:

Name   Age  Grade       City
0 Ali 20.0 85.0 None
1 Ahmet 22.0 NaN Ankara
2 AyÅŸe NaN 88.0 Izmir
4 Can NaN 92.0 Istanbul

In this case, we set thresh=3, meaning each row must contain at least 3 valid values. Row 3 (Zeynep), which has only 2 valid values, was removed.

# how='any': Deletes rows containing any NaN values
df_clean_any = df.dropna(how='any')
print(df_clean_any)

Output: All rows were deleted since each contains at least one NaN value.

# how='all': Deletes rows where all values are NaN
df_clean_all = df.dropna(how='all')
print(df_clean_all)

Output:

Name   Age  Grade       City
0 Ali 20.0 85.0 None
1 Ahmet 22.0 NaN Ankara
2 AyÅŸe NaN 88.0 Izmir
3 Zeynep 24.0 NaN None
4 Can NaN 92.0 Istanbul

Since each row has at least one valid (non-NaN) value, none were deleted.

Grouping data by specific categories and performing aggregation operations on each group is a commonly used method. The groupby() function is one of the most powerful and useful functions in the pandas library. This function allows you to group your data based on a certain criterion. For example, you can group data by age or by gender. After performing a grouping operation, you can make various calculations on these groups (e.g., average, sum, count, etc.).

To explain with a basic example:

Let’s assume that in a class, we have students’ ages and grades. We want to group these students by age and calculate the average grade for each age group. This is where groupby() comes into play. However, if we perform the grouping as shown below, we will encounter an error:

data = {
'name': ['Ali', 'AyÅŸe', 'Fatma', 'Ahmet', 'Zeynep'],
'city': ['Istanbul', 'Ankara', 'Istanbul', 'Izmir', 'Ankara'],
'age': [25, 30, 22, 40, 30],
'spending': [200, 300, 250, 400, 320],
'cash_in_hand': [100, 150, 120, 180, 200]
}
df = pd.DataFrame(data)

grouped_df = df.groupby('age').mean()
print(grouped_df)

The reason for this error is that when we try to apply the mean() function to the groupby() result, there are some non-numeric (string, object-type) columns, such as ‘name’ and ‘city’, that cannot be averaged. The mean() function can only be applied to numeric data. A few solutions exist for this, such as calculating the average for a specific column:

grouped_df = df.groupby('age')['spending'].mean()

Alternatively, we can select only the numeric columns either manually or automatically:

# Selecting numeric columns
numeric_df = df[['age', 'spending', 'cash_in_hand']]

# Automatically selecting numeric columns
numeric_df = df.select_dtypes(include=['number'])

# Getting averages by group
grouped_df = numeric_df.groupby(df['age']).mean()

Note: select_dtypes(include=['number']): This function selects all numeric columns in the DataFrame. The ‘number’ type includes all numeric data types (e.g., int64, float64).

Output:

      age  spending  cash_in_hand
age
22 22.0 250.0 120.0
25 25.0 200.0 100.0
30 30.0 310.0 175.0
40 40.0 400.0 180.0

Note: Don’t limit yourself to just using the groupby() function. Other operations can be performed as well, such as sum (sum()), count (count()), maximum (max()), etc.

If you want to perform multiple operations at once, you need to use the agg() function. agg() is short for “aggregate” (which means “to collect or combine”), and I believe it gets this name because it allows you to apply multiple functions together. In pandas, aggregate functions are typically summary operations performed on a group. These functions allow you to perform various calculations on grouped data. The agg() function enables you to apply multiple aggregate functions at once.

# Grouping by age and calculating both the average and sum of the grades
grouped_df = numeric_df.groupby('age').agg(['mean', 'sum'])

Output:

        spending      cash_in_hand     
mean sum mean sum
age
22 250.0 250 120.0 120
25 200.0 200 100.0 100
30 310.0 620 175.0 350
40 400.0 400 180.0 180

Grouping by multiple variables provides deeper insights. For example, you could calculate the average spending in different age groups for each city. However, if you try to do this as shown below, you’ll encounter an error:

grouped_df = df.groupby(['city', 'age']).mean()
print(grouped_df)

The reason for this is the same as we mentioned earlier: the ‘city’ column is not a numeric value. So, does this mean we can’t group by categorical values? Of course not! The solution is as follows:

grouped_df = df.groupby(['city', 'age']).mean(numeric_only=True)
print(grouped_df)

The numeric_only=True parameter allows you to calculate the mean only for numeric columns.

Output:

spending  cash_in_hand
city age
Ankara 30 310.0 175.0
Istanbul 22 250.0 120.0
25 200.0 100.0
Izmir 40 400.0 180.0

Grouping data into specific ranges is particularly useful for numerical data. You can define intervals using pd.cut() and perform grouping accordingly. For example, to group ages into intervals of 5 years:

intervals = list(range(20, 55, 5))
df['age_range'] = pd.cut(df['age'], bins=intervals)
# Or
df['age_range'] = pd.cut(df['age'], bins=6)

Then, you can group by both numerical and categorical columns and calculate averages:

# Select only numeric and categorical columns
numeric_df = df.select_dtypes(include=['number', 'category'])

# Group by age ranges and calculate the average
interval_group = numeric_df.groupby('age_range').mean()
print(interval_group)

Note: The bins parameter is used to split the numerical data into specific ranges. The pd.cut() function uses these ranges to convert continuous data into categorical data.

Output:

                age  spending  cash_in_hand
age_range
(21.982, 25.0] 23.5 225.0 110.0
(25.0, 28.0] NaN NaN NaN
(28.0, 31.0] 30.0 310.0 175.0
(31.0, 34.0] NaN NaN NaN
(34.0, 37.0] NaN NaN NaN
(37.0, 40.0] 40.0 400.0 180.0

You can perform richer analyses by merging different datasets. The merge() function allows you to combine data based on common columns. This function works similarly to the SQL JOIN operation and merges datasets based on a common key or column. In other words, it combines two DataFrames by matching data in one or more columns. By default, it works with how='inner'.

import pandas as pd

# Left DataFrame
df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Ali', 'AyÅŸe', 'Mehmet']
})
# Right DataFrame
df2 = pd.DataFrame({
'id': [2, 3, 4],
'city': ['Istanbul', 'Ankara', 'Bursa']
})
# Inner join, based on the common 'id' column
result = pd.merge(df1, df2, how='inner', on='id')
print(result)

Output:

id      name     city
0 2 AyÅŸe Istanbul
1 3 Mehmet Ankara

A Closer Look

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'))

Parameters:

  • left: The first DataFrame.
  • right: The second DataFrame.
  • how: Specifies the type of merge. The default is 'inner'. Different types of merges can be performed.
  • on: The name of the common column. This is the column that exists in both the left and right DataFrame.
  • left_on: The column name to merge from the left DataFrame.
  • right_on: The column name to merge from the right DataFrame.
  • left_index: If set to True, the left DataFrame’s index will be used for the merge.
  • right_index: If set to True, the right DataFrame’s index will be used for the merge.
  • suffixes: Suffixes to be added to overlapping column names. By default, ('_x', '_y') is used.

how Parameter

The how parameter specifies the type of merge to be performed. This is similar to SQL JOIN types. The possible values for how are:

  • ‘inner’ (default): Only the common rows from both DataFrames are kept. This means that rows with matching key values from both DataFrames are merged. It corresponds to SQL’s INNER JOIN.
  • ‘outer’: All rows from both DataFrames are included, with missing values filled as NaN. It corresponds to SQL’s FULL OUTER JOIN.
  • ‘left’: All rows from the left DataFrame are kept, and only matching rows from the right DataFrame are included. For rows that don’t match in the right DataFrame, NaN values are used. This corresponds to SQL’s LEFT JOIN.
  • ‘right’: All rows from the right DataFrame are kept, and only matching rows from the left DataFrame are included. For rows that don’t match in the left DataFrame, NaN values are used. This corresponds to SQL’s RIGHT JOIN.

on, left_on, and right_on Parameters

These parameters define which columns are used to perform the merge:

  • on: If the columns to be merged are the same in both DataFrames, you can use the on parameter to specify the common column.
  • left_on and right_on: If the columns to be merged have different names in each DataFrame, you can use left_on and right_on to specify the column names in the left and right DataFrame respectively.

left_index and right_index Parameters

If the merge operation is based on the index rather than columns, you can set left_index=True and/or right_index=True.

10. Time Series Analysis

Pandas makes it easy to analyze time series data. You can convert date information using pd.to_datetime() and use it as an index for time-based analysis.

import pandas as pd

# Let's create a DataFrame with some example data
data = {
'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-02-01', '2024-05-01'],
'value': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)
# Convert the 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'])
# Set the 'date' column as the index
df.set_index('date', inplace=True)
# Select a specific date range in 2024
specific_date_range = df['2024-01-01':'2024-02-29']
# Print the results
print(specific_date_range)

Note: Remember that the dates must be valid, so you can’t search for non-existent dates like February 30th or the 13th month!

Output:

              value
date
2024-01-01 100
2024-01-02 200
2024-01-03 300
2024-02-01 400

With time series analysis, you can easily explore trends and seasonal changes.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here