Image by publisher
Introduction
In the world of data science, there’s a challenging reality: the pristine, textbook examples we often learn from rarely mirror the complexity of real-world data. While academic exercises present neatly organized datasets, actual projects involve confronting outliers, skewed distributions, and unpredictable variances. This mismatch requires data scientists to be adaptable and robust in their approach.
A previous discussion introduced an exploratory data analysis pipeline using Penguin, a tool adept at identifying when data fails to adhere to standard assumptions like homoscedasticity and normality. But when traditional tests falter, dismissing data isn’t the answer; the solution lies in adopting robust statistical methods.
This article delves into using robust statistics with Penguin, offering techniques to derive reliable insights from imperfect data. By exploring three scenarios, we’ll demonstrate how to navigate the challenges of real-world datasets using Python’s Penguin library.
Initial Setup
To get started, ensure you have Penguin and Pandas installed. We’ll work with the wine quality dataset, which exemplifies the kind of messy data often encountered in practice.
!pip install penguin pandas
import pandas as pd
import penguin as pg
# Load the wine quality dataset
url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
df = pd.read_csv(url)
# Preview the data
df.head()
Having previously explored Penguin, you’ll recall that this dataset is far from perfect, challenging several typical assumptions. We’ll now embark on three “adventures,” each addressing a specific data challenge with a robust solution.
Adventure 1: When the Normality Test Fails
Imagine testing the normality of alcohol content in white and red wine samples. Despite our hopes, neither distribution proves normal, indicated by low p-values. Non-normality often signals potential outliers or skewness, making standard t-tests risky.
In such cases, the Mann-Whitney U test is a robust alternative. This test compares data ranks, mitigating outlier effects. Here’s how to implement it:
# Separate the groups
red_wine = df[df['type'] == 'red']['alcohol']
white_wine = df[df['type'] == 'white']['alcohol']
# Conduct the Mann-Whitney U test
mwu_results = pg.mwu(x=red_wine, y=white_wine)
print(mwu_results)
The output:
U_val alternative p_val RBC KEYS
MWU 3829043.5 double sided 0.181845 -0.022193 0.488903
The p-value suggests no significant difference in alcohol content between wine types, a conclusion robust against outliers and skewness.
Adventure 2: When the Paired T Test Fails
Consider comparing two measurements taken from the same subject, such as a patient’s sugar levels before and after a treatment. If the differences in these paired measures aren’t normally distributed, a standard paired t-test is unreliable.
The Wilcoxon signed rank test is the ideal solution, ranking the absolute differences between paired measures. Here’s how it’s done in Penguin:
# Run the Wilcoxon signed rank test
wilcoxon_results = pg.wilcoxon(x=df['fixed acidity'], y=df['volatile acidity'])
print(wilcoxon_results)
Output:
W_val alternative p_val RBC KEYS
Wilcoxon 0.0 two-sided 0.0 1.0 1.0
The result indicates a significant difference between the measures, highlighting their distinct levels of magnitude.
Adventure 3: When ANOVA Fails
Finally, let’s assess whether residual sugar levels differ across wine quality scores, which range from 3 to 9. If the variance in sugar levels is inconsistent across categories, traditional ANOVA may mislead due to its equal variance assumption.
Welch’s ANOVA corrects this by penalizing high-variance groups, ensuring fair comparisons. Here’s how to apply it:
# Perform Welch's ANOVA
welch_results = pg.welch_anova(data=df, dv='residual sugar', between='quality')
print(welch_results)
Output:
Source ddof1 ddof2 F p_unc np2
quality 6 54.507934 10.918282 5.937951e-08 0.008353
Welch’s ANOVA indicates a significant difference in sugar levels across quality scores, despite variance discrepancies. However, sugar is just one factor among many influencing wine quality, as reflected by the low eta-squared value.
Conclusion
Through these scenarios, we see that adept data scientists thrive not on perfect data, but on their ability to adapt when data defies expectations. Penguin offers robust tests to bypass flawed assumptions and glean valid insights with minimal effort.
Ivan Palomares Carrascosa is a distinguished leader in AI, machine learning, and deep learning, sharing his expertise to guide others in real-world AI applications.
For further insights, visit the source: Here
“`

