More Robust Multivariate EDA with Statistical Testing | by Pararawendy Indarjo | Apr, 2024


Regarding the goal of multivariate EDA on this dataset, we naturally want to know which factors influence car fuel efficiency. To that end, we will answer the following questions:

  1. What numerical features influence mpg performance?
  2. Do mpg profiles vary depending on origin?
  3. Do different origins result in varying profiles of car efficiency?

Numeric-to-Numeric Relationship

For the first case of multivariate EDA, let’s discuss about identifying relationship between two numerical variables. In this case, it is well known that we can use a scatter plot to visually inspect any relationship that exists between the variables.

As previously stated, not all observed patterns are guaranteed meaningful. In the numeric-to-numeric case, we can supplement the scatter plot with the Pearson correlation test. First, we calculate the Pearson correlation coefficient for the plotted variables. Second, we determine whether the obtained coefficient is significant by computing its p-values.

The latter step is important as a sanity check whether a certain value of correlation coefficient is larger enough to be considered as meaningful (i.e., there is a linear relationship between plotted variables). This is especially true in the small data size regime. For example, if we only have 10 data points, the correlation coefficient must be at least 0.64 to be considered significant (ref)!

In python, we can use pearsonr function from thescipy library to do the mentioned correlation test.

In the following codes, we draw a scatter plot for each pair of numerical features-mpg column. As a title, we print the correlation coefficient plus conditional double-asteriks characters if the coefficient is significant (p-value < 0.05).

import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# prepare variables to inspect
numeric_features = ['cylinders','displacement','horsepower',
'weight','acceleration','model_year']
target = 'mpg'

# Create a figure and axis
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 6))

# Loop through the numerical columns and plot each scatter plot
for i, col in enumerate(features):
# Calculate Pearson correlation coefficient
corr_coeff, p_val = pearsonr(df[col],df[target])

# Scatter plot using seaborn
sns.scatterplot(data=df, x=col, y=target, ax=axes[i//3, i%3])

# Set title with Pearson correlation coefficient
# Print ** after the correlation if the correlation coefficient is significant
axes[i//3, i%3].set_title(f'{col} vs {target} (Corr: {corr_coeff:.2f} {"**" if p_val < 0.05 else ""})')

plt.tight_layout()
plt.show()

Numerical features vs mpg (Image by Author)

Observe that all plot titles contain a double asterix, indicating that the correlations are significant. Thus, we can conclude the following:

  1. Cylinders, displacement, horsepower, and weight have a strong negative correlation with mpg. This means that for each of these variables, a higher value corresponds to lower fuel efficiency.
  2. Acceleration and model year have a medium positive correlation with mpg. This means that longer acceleration times (slower cars) and more recently produced cars are associated with higher fuel efficiency.

Numeric-to-Categoric Relationship

Next, we’ll investigate if the mpg profiles differ depending on the origin. Note that origin is a categorical variable. As a result, we’re considering the numeric-to-categorical case.

A KDE (kernel density estimation) plot, also known as a smooth version of a histogram, can be used to visualize the data with breakdowns for each origin value.

In terms of statistical testing, we can use one-way ANOVA. The hypothesis we want to test is whether there are significant mean differences in mpg between different car origins.

In python, we can use f_oneway function from scipy library to perform one-way ANOVA.

In the following code, we create a KDE plot of mpg with breakdowns for different origin values. Next, we run one-way ANOVA and display the p-value in the title.

import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import f_oneway

# Create a KDE plot with hue
sns.set(style="whitegrid")
ax = sns.kdeplot(data=df, x="mpg", hue="origin", fill=True)

# Calculate one-way ANOVA p-value
p_value = f_oneway(*[df[df['origin'] == cat]['mpg'] for cat in df['origin'].unique()])[1]

# Set title with one-way ANOVA p-value
ax.set_title(f'KDE Plot mpg by origin (One-way ANOVA p-value: {p_value:.4f})')

plt.show()

KDE plot of MPG by origin (Image by Author)

The p-value in the plot above is less than 0.05, indicating significance. On a high level, we can interpret the plot like this: In general, cars made in the United States are less fuel efficient than cars made elsewhere (this is because the peak of USA mpg distribution is located on the left when compared to other origins’).

Categoric-to-Categoric Relationship

Finally, we will evaluate the scenario in which we have two categorical variables. Considering our dataset, we’ll see if different origins produce different car efficiency profiles.

In this case, a count plot with breakdown is the appropriate bivariate visualization. We’ll show the frequency of cars for each origin, broken down by efficiency flag (yes/no).

In terms of statistical testing method to use, chi-square test is the one to go. Using this test, we want to validate if different car origins have different distribution of efficient vs inefficient cars.

In python, we can use chisquare function from scipy library. However, unlike the previous cases, we must first prepare the data. Specifically, we need to calculate the “expected frequency” of each origin-efficient value combination.

For readers who want a more in-depth explanation of this expected frequency concept and chi square test overall mechanics, I recommend reading my blog on the subject, which is attached below.

The codes to perform the mentioned data preparation are given below.

# create frequency table of each origin-efficient pair
chi_df = (
df[['origin','efficiency']]
.value_counts()
.reset_index()
.sort_values(['origin','efficiency'], ignore_index=True)
)

# calculate expected frequency for each pair
n = chi_df['count'].sum()

exp = []
for i in range(len(chi_df)):
sum_row = chi_df.loc[chi_df['origin']==chi_df['origin'][i],'count'].sum()
sum_col = chi_df.loc[chi_df['efficiency']==chi_df['efficiency'][i],'count'].sum()
e = sum_row * sum_col / n
exp.append(e)

chi_df['exp'] = exp
chi_df

chi_df result (Image by Author)

Finally, we can execute the codes below to draw the count plot of car origins with breakdowns on efficiency flags. Furthermore, we use chi_df to perform the chi-square test and get the p-value.

import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chisquare

# Create a count plot with hue
sns.set(style="whitegrid")
ax = sns.countplot(data=df, x="origin", hue="efficiency", fill=True)

# Calculate chi-square p-value
p_value = chisquare(chi_df['count'], chi_df['exp'])[1]

# Set title with chi-square p-value
ax.set_title(f'Count Plot efficiency vs origin (chi2 p-value: {p_value:.4f})')

plt.show()

Count plot efficiency vs origin (Image by Author)

The plot indicates that there are differences in the distribution of efficient cars across origins (p-value < 0.05). We can see that American cars are mostly inefficient, whereas Japanese and European cars follow the opposite pattern.

In this blog post, we learned how to improve bivariate visualization using appropriate statistical testing methods. This would improve the robustness of our multivariate EDA by filtering out noise-induced relationships that would otherwise be visible based solely on visual inspection of bivariate plots.

I hope this article will help you during your next EDA exercise! All in all, thanks for reading, and let’s connect with me on LinkedIn! 👋

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here