The “robust” Data Scientist: win with Messy Data and Penguin

Image by publisher

Introduction

In the world of data science, there’s a challenging reality: the pristine, textbook examples we often learn from rarely mirror the complexity of real-world data. While academic exercises present neatly organized datasets, actual projects involve confronting outliers, skewed distributions, and unpredictable variances. This mismatch requires data scientists to be adaptable and robust in their approach.

A previous discussion introduced an exploratory data analysis pipeline using Penguin, a tool adept at identifying when data fails to adhere to standard assumptions like homoscedasticity and normality. But when traditional tests falter, dismissing data isn’t the answer; the solution lies in adopting robust statistical methods.

This article delves into using robust statistics with Penguin, offering techniques to derive reliable insights from imperfect data. By exploring three scenarios, we’ll demonstrate how to navigate the challenges of real-world datasets using Python’s Penguin library.

Initial Setup

To get started, ensure you have Penguin and Pandas installed. We’ll work with the wine quality dataset, which exemplifies the kind of messy data often encountered in practice.



!pip install penguin pandas 

import pandas as pd 

import penguin as pg 



# Load the wine quality dataset

url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv" 

df = pd.read_csv(url) 



# Preview the data

df.head()

Having previously explored Penguin, you’ll recall that this dataset is far from perfect, challenging several typical assumptions. We’ll now embark on three “adventures,” each addressing a specific data challenge with a robust solution.

Adventure 1: When the Normality Test Fails

Imagine testing the normality of alcohol content in white and red wine samples. Despite our hopes, neither distribution proves normal, indicated by low p-values. Non-normality often signals potential outliers or skewness, making standard t-tests risky.

In such cases, the Mann-Whitney U test is a robust alternative. This test compares data ranks, mitigating outlier effects. Here’s how to implement it:



# Separate the groups

red_wine = df[df['type'] == 'red']['alcohol']

white_wine = df[df['type'] == 'white']['alcohol']



# Conduct the Mann-Whitney U test

mwu_results = pg.mwu(x=red_wine, y=white_wine) 

print(mwu_results)

The output:



U_val alternative p_val RBC KEYS 

MWU 3829043.5 double sided 0.181845 -0.022193 0.488903

The p-value suggests no significant difference in alcohol content between wine types, a conclusion robust against outliers and skewness.

Adventure 2: When the Paired T Test Fails

Consider comparing two measurements taken from the same subject, such as a patient’s sugar levels before and after a treatment. If the differences in these paired measures aren’t normally distributed, a standard paired t-test is unreliable.

The Wilcoxon signed rank test is the ideal solution, ranking the absolute differences between paired measures. Here’s how it’s done in Penguin:



# Run the Wilcoxon signed rank test

wilcoxon_results = pg.wilcoxon(x=df['fixed acidity'], y=df['volatile acidity']) 

print(wilcoxon_results)

Output:



W_val alternative p_val RBC KEYS 

Wilcoxon 0.0 two-sided 0.0 1.0 1.0

The result indicates a significant difference between the measures, highlighting their distinct levels of magnitude.

Adventure 3: When ANOVA Fails

Finally, let’s assess whether residual sugar levels differ across wine quality scores, which range from 3 to 9. If the variance in sugar levels is inconsistent across categories, traditional ANOVA may mislead due to its equal variance assumption.

Welch’s ANOVA corrects this by penalizing high-variance groups, ensuring fair comparisons. Here’s how to apply it:



# Perform Welch's ANOVA

welch_results = pg.welch_anova(data=df, dv='residual sugar', between='quality') 

print(welch_results)

Output:



Source ddof1 ddof2 F p_unc np2 

quality 6 54.507934 10.918282 5.937951e-08 0.008353

Welch’s ANOVA indicates a significant difference in sugar levels across quality scores, despite variance discrepancies. However, sugar is just one factor among many influencing wine quality, as reflected by the low eta-squared value.

Conclusion

Through these scenarios, we see that adept data scientists thrive not on perfect data, but on their ability to adapt when data defies expectations. Penguin offers robust tests to bypass flawed assumptions and glean valid insights with minimal effort.

Ivan Palomares Carrascosa is a distinguished leader in AI, machine learning, and deep learning, sharing his expertise to guide others in real-world AI applications.

For further insights, visit the source: Here

“`

Everyone navigates AI security in real time – even Google

On Trails is a wandering tale that combines hiking, science and history

6 Kitchen Gadgets That Make Being an Adult Easier

Where Wild Things Roam: Identifying Wildlife with SpeciesNet

The “robust” Data Scientist: win with Messy Data and Penguin

Introduction

Initial Setup

Adventure 1: When the Normality Test Fails

Adventure 2: When the Paired T Test Fails

Adventure 3: When ANOVA Fails

Conclusion

Everyone navigates AI security in real time – even Google

On Trails is a wandering tale that combines hiking, science and history

6 Kitchen Gadgets That Make Being an Adult Easier

Where Wild Things Roam: Identifying Wildlife with SpeciesNet

Microsoft releases open source AI security tools for agent development – Campus Technology

Where Wild Things Roam: Identifying Wildlife with SpeciesNet

WAXAL: a large-scale open resource for African language speech technology

10 GitHub repositories to master quantitative trading

Agentic RAG explained in 3 difficulty levels

Introducing Groundsource: Turning Stories into Data with Gemini

LEAVE A REPLY Cancel reply

Useful Links

Latest News

On Trails is a wandering tale that combines hiking, science and history

6 Kitchen Gadgets That Make Being an Adult Easier

Where Wild Things Roam: Identifying Wildlife with SpeciesNet

Our Newsletter