HomeMachine LearningBuilding modern EDA pipelines with Penguin

Building modern EDA pipelines with Penguin

Introduction

Anyone who has spent significant time in data science inevitably comes to realize an essential truth: the quality of input data profoundly influences the outcomes of machine learning models. This principle, often summarized as garbage in, garbage out, underscores the importance of robust data preprocessing and validation.

For instance, using highly collinear data in linear regression or conducting ANOVA tests on data with heteroskedastic variances can lead to suboptimal models. Exploratory Data Analysis (EDA) offers valuable insights through visualizations like scatterplots and histograms, but these tools alone may not suffice for rigorous validation of data against the mathematical assumptions required in downstream analyses or models. Enter Penguin, a library that bridges the gap between two well-established data science libraries: SciPy and pandas. Penguin is a powerful ally in creating robust and automated EDA pipelines. This article explores how to build a comprehensive pipeline for rigorous statistical EDA, validating several critical data properties.

Initial Setup

To begin, let’s ensure that Penguin is installed in our Python environment, along with pandas, if it isn’t already:

!pip install penguin pandas

Once installed, we can import these key libraries and load our data. For this demonstration, we’ll use a dataset containing wine properties and quality scores.

import pandas as pd
import penguin as pg
# Loading wine dataset from open dataset
url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
df = pd.read_csv(url)
# Displaying the first lines to understand our data
df.head()

Checking for Univariate Normality

The first step in our exploratory analysis involves checking for univariate normality. Many traditional machine learning algorithms and statistical tests, such as ANOVAs and t-tests, assume that continuous variables follow a normal (Gaussian) distribution. Using Penguin’s pg.normality() function, we can perform a Shapiro-Wilk test across the entire dataframe:

# Selecting a subset of continuous features for normality check
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'pH', 'alcohol']

# Running the normality test
normality_results = pg.normality(df[features])
print(normality_results)

Output:

                         W         pval  normal
fixed acidity 0.879789 2.437973e-57 False
volatile acidity 0.875867 6.255995e-58 False
citric acid 0.964977 5.262332e-37 False
pH 0.991448 2.204049e-19 False
alcohol 0.953532 2.918847e-41 False

None of the numerical characteristics satisfy normality. This is not necessarily a problem with the data; it merely reflects its inherent characteristics. Subsequent data preprocessing steps might involve applying transformations like log or Box-Cox to make the data appear more normal, making it suitable for models that assume normality.

Checking Multivariate Normality

Beyond univariate normality, we should also assess multivariate normality—an important consideration for techniques such as multivariate ANOVA (MANOVA). Here’s how we can perform this check:

# Henze-Zirkler multivariate normality test
multivariate_normality_results = pg.multivariate_normality(df[features])
print(multivariate_normality_results)

Output:

HzResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False)

Multivariate normality doesn’t hold either, suggesting that non-parametric tree-based models like gradient boosting and random forests might be more robust alternatives to parametric models like SVM and linear regression.

Checking Homoscedasticity

The next step involves testing for homoscedasticity, which refers to the equality of variances across prediction errors. This property is a measure of reliability, and we can test it using Penguin’s implementation of the Levene test:

# Levene's test for equal variances between groups
# 'dv' is the target dependent variable, 'group' is the categorical variable
homoscedasticity_results = pg.homoscedasticity(data=df, dv='alcohol', group='quality')
print(homoscedasticity_results)

Result:

                W         pval  equal_var
levene 66.338684 2.317649e-80 False

The result indicates heteroscedasticity, which should be considered in subsequent analyses. A possible solution is to use robust standard errors when training regression models.

Checking Sphericity

Sphericity is another statistical property to examine, identifying whether the variances of differences between possible pairwise combinations of conditions are equal. Testing for sphericity is crucial before performing dimensionality reduction techniques like PCA:

# Mauchly sphericity test
sphericity_results = pg.sphericity(df[features])
print(sphericity_results)

Result:

SpherResults(spher=False, W=np.float64(0.004437706589942777), chi2=np.float64(35184.26583883276), dof=9, pval=np.float64(0.0))

Despite the challenging dataset, the test indicates significant correlations between variables, suggesting that PCA could be beneficial for dimensionality reduction.

Checking Multicollinearity

Finally, we examine multicollinearity, which identifies whether predictors are highly correlated. This can be problematic in interpretable models like linear regressors:

# Calculation of a robust correlation matrix with p-values
correlation_matrix = pg.rcorr(df[features], method='pearson')
print(correlation_matrix)

Output matrix:

                 fixed acidity volatile acidity citric acid pH alcohol
fixed acidity - *** *** *** ***
volatile acidity 0.219 - *** *** **
citric acid 0.324 -0.378 - ***
pH -0.253 0.261 -0.33 - ***
alcohol -0.095 -0.038 -0.01 0.121 -

Penguin’s correlation matrix highlights levels of statistical significance with asterisks. None of the pairwise correlations are excessively large, indicating that these features provide unique, non-overlapping information for further analyses.

Conclusion

Through a series of practical examples, this article demonstrated how to leverage Penguin, an open-source Python library, to create robust and modern EDA pipelines. These pipelines facilitate better decision-making in data preprocessing and downstream analyses by employing advanced statistical tests and machine learning models, guiding the selection of appropriate actions and models.

Ivan Palomares Carrascosa is a leader, writer, speaker, and advisor in AI, machine learning, deep learning, and LLM. He trains and guides others in leveraging AI in the real world.

For more information, visit the original source Here.

“`

Must Read
Related News

LEAVE A REPLY

Please enter your comment!
Please enter your name here