The Chi-squared test for independence is a statistical procedure employed to assess the relationship between two categorical variables – determining whether they are associated or independent. In the dynamic realm of real estate, where a property’s visual appeal often impacts its valuation, the exploration becomes particularly intriguing. But how often do you associate a house’s external allure with functional features like a garage? Using the Ames housing dataset, this exploration delves deep into discerning whether there exists a statistically significant association between the external quality of a house and the presence of a garage. As you navigate through statistical waters using the Chi-squared test, you unearth intriguing insights that are both enlightening and thought-provoking.
Let’s get started.
Overview
This post is divided into four parts; they are:
- Understanding the Chi-Squared Test
- How the Chi-Squared Test Works
- Unraveling the Association Between External Quality and Garage Presence
- Important Caveats
Understanding the Chi-Squared Test
The Chi-squared ($\chi^2$) test is useful because of its ability to test for associations between categorical variables. It’s particularly valuable when working with nominal or ordinal data, where the variables are divided into categories or groups. The primary purpose of the Chi-squared test is to determine whether there is a statistically significant association between two categorical variables. In other words, it helps to answer questions such as:
- Are two categorical variables independent of each other?
- If the variables are independent, changes in one variable are not related to changes in the other. There is no association between them.
- Is there a significant association between the two categorical variables?
- If the variables are associated, changes in one variable are related to changes in the other. The Chi-squared test helps to quantify whether this association is statistically significant.
In your study, you focus on the external quality of a house (categorized as “Great” or “Average”) and its relation to the presence or absence of a garage. For the results of the Chi-squared test to be valid, the following conditions must be satisfied:
- Independence: The observations must be independent, meaning the occurrence of one outcome shouldn’t affect another. Our dataset satisfies this as each entry represents a distinct house.
- Sample Size: The dataset should not only be randomly sampled but also sizable enough to be representative. Our data, sourced from Ames, Iowa, meets this criterion.
- Expected Frequency: Every cell in the contingency table should have an expected frequency of at least 5. This is vital for the test’s reliability, as the Chi-squared test relies on a large sample approximation. You will demonstrate this condition below by creating and visualizing the expected frequencies.
Kick-start your project with my book The Beginner’s Guide to Data Science. It provides self-study tutorials with working code.
How the Chi-Squared Test Works
Chi-squared test compares the observed frequencies from data to the expected frequencies from assumptions.
The Chi-squared test works by comparing the observed frequencies of the categories in a contingency table to the expected frequencies that would be expected under the assumption of independence. The contingency table is a cross-tabulation of the two categorical variables, showing how many observations fall into each combination of categories.
- Null Hypothesis ($H_0$): The null hypothesis in the Chi-squared test assumes independence between the two variables, i.e., the observed frequencies (with or without garage) should match.
- Alternative Hypothesis ($H_1$): The alternative hypothesis suggests that there is a significant association between the two variables, i.e., the observed frequencies (with or without garage) should differ based on the value of another variable (quality of a house).
The test statistic in the Chi-squared test is calculated by comparing the observed and expected frequencies in each cell of the contingency table. The larger the difference between observed and expected frequencies, the larger the Chi-squared statistic becomes. The Chi-squared test produces a p-value, which indicates the probability of observing the observed association (or a more extreme one) under the assumption of independence. If the p-value is below a chosen significance level $\alpha$ (commonly 0.05), the null hypothesis of independence is rejected, suggesting a significant association.
Unraveling the Association Between External Quality and Garage Presence
Using the Ames housing dataset, you set out to determine whether there’s an association between a house’s external quality and the presence or absence of a garage. Let’s delve into the specifics of our analysis, supported by the corresponding Python code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# Importing the essential libraries import pandas as pd from scipy.stats import chi2_contingency
# Load the dataset Ames = pd.read_csv(‘Ames.csv’)
# Extracting the relevant columns exterqual_garagefinish_data = Ames[[‘ExterQual’, ‘GarageFinish’]].copy()
# Filling missing values in the ‘GarageFinish’ column with ‘No Garage’ exterqual_garagefinish_data[‘GarageFinish’].fillna(‘No Garage’, inplace=True)
# Grouping ‘GarageFinish’ into ‘With Garage’ and ‘No Garage’ exterqual_garagefinish_data[‘Garage Group’] \ = exterqual_garagefinish_data[‘GarageFinish’] \ .apply(lambda x: ‘With Garage’ if x != ‘No Garage’ else ‘No Garage’)
# Grouping ‘ExterQual’ into ‘Great’ and ‘Average’ exterqual_garagefinish_data[‘Quality Group’] \ = exterqual_garagefinish_data[‘ExterQual’] \ .apply(lambda x: ‘Great’ if x in [‘Ex’, ‘Gd’] else ‘Average’)
# Constructing the simplified contingency table simplified_contingency_table \ = pd.crosstab(exterqual_garagefinish_data[‘Quality Group’], exterqual_garagefinish_data[‘Garage Group’])
#Printing the Observed Frequency print(“Observed Frequencies:”) observed_df = pd.DataFrame(simplified_contingency_table, index=[“Average”, “Great”], columns=[“No Garage”, “With Garage”]) print(observed_df) print()
# Performing the Chi-squared test chi2_stat, p_value, _, expected_freq = chi2_contingency(simplified_contingency_table)
# Printing the Expected Frequencies print(“Expected Frequencies:”) print(pd.DataFrame(expected_freq, index=[“Average”, “Great”], columns=[“No Garage”, “With Garage”]).round(1)) print()
# Printing the results of the test print(f“Chi-squared Statistic: {chi2_stat:.4f}”) print(f“p-value: {p_value:.4e}”) |
The output should be:
Observed Frequencies: No Garage With Garage Average 121 1544 Great 8 906
Expected Frequencies: No Garage With Garage Average 83.3 1581.7 Great 45.7 868.3
Chi-squared Statistic: 49.4012 p-value: 2.0862e-12 |
The code above performs three steps:
Data Loading & Preparation:
- You began by loading the dataset and extracting the pertinent columns:
ExterQual
(Exterior Quality) andGarageFinish
(Garage Finish). - Recognizing the missing values in
GarageFinish
, you sensibly imputed them with the label"No Garage"
, indicating houses devoid of garages.
Data Grouping for Simplification:
- You further categorized the
GarageFinish
data into two groups: “With Garage” (for houses with any kind of garage) and “No Garage”. - Similarly, you grouped the
ExterQual
data into “Great” (houses with excellent or good exterior quality) and “Average” (houses with average or fair exterior quality).
Chi-squared Test:
- With the data aptly prepared, you constructed a contingency table to depict the observed frequencies between the newly formed categories. They are the two tables printed in the output.
- You then performed a Chi-squared test on this contingency table using SciPy. The p-value is printed and found much less than $\alpha$. The extremely low p-value obtained from the test signifies a statistically significant association between a house’s external quality and the presence of a garage in this dataset.
- A quick glance at the expected frequencies satisfies the third condition of a Chi-squared test, which requires a minimum of 5 occurrences in each cell.
Through this analysis, you not only refined and simplified the data to make it more interpretable but also provided statistical evidence of an association between two categorical variables of interest.
Important Caveats
The Chi-squared test, despite its utility, has its limitations:
- No Causation: While the test can determine association, it doesn’t infer causation. So, even though there’s a significant link between a house’s external quality and its garage presence, you can’t conclude that one causes the other.
- Directionality: The test indicates an association but doesn’t specify its direction. However, our data suggests that houses labeled as “Great” in terms of external quality are more likely to have garages than those labeled as “Average”.
- Magnitude: The test doesn’t provide insights into the relationship’s strength. Other metrics, like Cramér’s V, would be more informative in this regard.
- External Validity: Our conclusions are specific to the Ames dataset. Caution is advised when generalizing these findings to other regions.
Further Reading
Online
Resources
Summary
In this post, you delved into the Chi-squared test and its application on the Ames housing dataset. You discovered a significant association between a house’s external quality and the presence of a garage.
Specifically, you learned:
- The fundamentals and practicality of the Chi-squared test.
- The Chi-squared test revealed a significant association between a house’s external quality and the presence of a garage in the Ames dataset. Houses with a “Great” external quality rating showed a higher likelihood of having a garage when compared to those with an “Average” rating, a trend that was statistically significant.
- The vital caveats and limitations of the Chi-squared test.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.