Building modern EDA pipelines with Penguin

Introduction

Anyone who has spent significant time in data science inevitably comes to realize an essential truth: the quality of input data profoundly influences the outcomes of machine learning models. This principle, often summarized as garbage in, garbage out, underscores the importance of robust data preprocessing and validation.

For instance, using highly collinear data in linear regression or conducting ANOVA tests on data with heteroskedastic variances can lead to suboptimal models. Exploratory Data Analysis (EDA) offers valuable insights through visualizations like scatterplots and histograms, but these tools alone may not suffice for rigorous validation of data against the mathematical assumptions required in downstream analyses or models. Enter Penguin, a library that bridges the gap between two well-established data science libraries: SciPy and pandas. Penguin is a powerful ally in creating robust and automated EDA pipelines. This article explores how to build a comprehensive pipeline for rigorous statistical EDA, validating several critical data properties.

Initial Setup

To begin, let’s ensure that Penguin is installed in our Python environment, along with pandas, if it isn’t already:

!pip install penguin pandas

Once installed, we can import these key libraries and load our data. For this demonstration, we’ll use a dataset containing wine properties and quality scores.

import pandas as pd

import penguin as pg

# Loading wine dataset from open dataset

url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"

df = pd.read_csv(url)

# Displaying the first lines to understand our data

df.head()

Checking for Univariate Normality

The first step in our exploratory analysis involves checking for univariate normality. Many traditional machine learning algorithms and statistical tests, such as ANOVAs and t-tests, assume that continuous variables follow a normal (Gaussian) distribution. Using Penguin’s pg.normality() function, we can perform a Shapiro-Wilk test across the entire dataframe:

# Selecting a subset of continuous features for normality check

features = ['fixed acidity', 'volatile acidity', 'citric acid', 'pH', 'alcohol']



# Running the normality test

normality_results = pg.normality(df[features])

print(normality_results)

Output:

                         W         pval  normal

fixed acidity     0.879789  2.437973e-57  False

volatile acidity  0.875867  6.255995e-58  False

citric acid       0.964977  5.262332e-37  False

pH                0.991448  2.204049e-19  False

alcohol           0.953532  2.918847e-41  False

None of the numerical characteristics satisfy normality. This is not necessarily a problem with the data; it merely reflects its inherent characteristics. Subsequent data preprocessing steps might involve applying transformations like log or Box-Cox to make the data appear more normal, making it suitable for models that assume normality.

Checking Multivariate Normality

Beyond univariate normality, we should also assess multivariate normality—an important consideration for techniques such as multivariate ANOVA (MANOVA). Here’s how we can perform this check:

# Henze-Zirkler multivariate normality test

multivariate_normality_results = pg.multivariate_normality(df[features])

print(multivariate_normality_results)

Output:

HzResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False)

Multivariate normality doesn’t hold either, suggesting that non-parametric tree-based models like gradient boosting and random forests might be more robust alternatives to parametric models like SVM and linear regression.

Checking Homoscedasticity

The next step involves testing for homoscedasticity, which refers to the equality of variances across prediction errors. This property is a measure of reliability, and we can test it using Penguin’s implementation of the Levene test:

# Levene's test for equal variances between groups

# 'dv' is the target dependent variable, 'group' is the categorical variable

homoscedasticity_results = pg.homoscedasticity(data=df, dv='alcohol', group='quality')

print(homoscedasticity_results)

Result:

                W         pval  equal_var

levene  66.338684  2.317649e-80      False

The result indicates heteroscedasticity, which should be considered in subsequent analyses. A possible solution is to use robust standard errors when training regression models.

Checking Sphericity

Sphericity is another statistical property to examine, identifying whether the variances of differences between possible pairwise combinations of conditions are equal. Testing for sphericity is crucial before performing dimensionality reduction techniques like PCA:

# Mauchly sphericity test

sphericity_results = pg.sphericity(df[features])

print(sphericity_results)

Result:

SpherResults(spher=False, W=np.float64(0.004437706589942777), chi2=np.float64(35184.26583883276), dof=9, pval=np.float64(0.0))

Despite the challenging dataset, the test indicates significant correlations between variables, suggesting that PCA could be beneficial for dimensionality reduction.

Checking Multicollinearity

Finally, we examine multicollinearity, which identifies whether predictors are highly correlated. This can be problematic in interpretable models like linear regressors:

# Calculation of a robust correlation matrix with p-values

correlation_matrix = pg.rcorr(df[features], method='pearson')

print(correlation_matrix)

Output matrix:

                 fixed acidity volatile acidity citric acid pH alcohol

fixed acidity    -             ***               ***        *** ***

volatile acidity 0.219         -                 ***        *** **

citric acid      0.324         -0.378            -          *** 

pH               -0.253        0.261             -0.33      -   ***

alcohol          -0.095        -0.038            -0.01      0.121 -

Penguin’s correlation matrix highlights levels of statistical significance with asterisks. None of the pairwise correlations are excessively large, indicating that these features provide unique, non-overlapping information for further analyses.

Conclusion

Through a series of practical examples, this article demonstrated how to leverage Penguin, an open-source Python library, to create robust and modern EDA pipelines. These pipelines facilitate better decision-making in data preprocessing and downstream analyses by employing advanced statistical tests and machine learning models, guiding the selection of appropriate actions and models.

Ivan Palomares Carrascosa is a leader, writer, speaker, and advisor in AI, machine learning, deep learning, and LLM. He trains and guides others in leveraging AI in the real world.

For more information, visit the original source Here.

“`

Improving verifiability in AI development

DeepMind spinout Isomorphic Labs raises $2.1 billion self.__wrap_b(“:Rl6glm:”,0.7)

Comau and OMRON Robotics partner to bring robotics to more industries

Presentation of Claude Platform on AWS: Anthropic’s native platform, via your AWS account

Building modern EDA pipelines with Penguin

Introduction

Initial Setup

Checking for Univariate Normality

Checking Multivariate Normality

Checking Homoscedasticity

Checking Sphericity

Checking Multicollinearity

Conclusion

Improving verifiability in AI development

DeepMind spinout Isomorphic Labs raises $2.1 billion self.__wrap_b(“:Rl6glm:”,0.7)

Comau and OMRON Robotics partner to bring robotics to more industries

Presentation of Claude Platform on AWS: Anthropic’s native platform, via your AWS account

As public criticism of vaccines fades, RFK Jr. continues to conduct safety investigations behind the scenes: NYT

Presentation of Claude Platform on AWS: Anthropic’s native platform, via your AWS account

Guardrails for LLMs: measuring AI “hallucinations” and verbosity

RNNs cannot think what transformers think cheaply. ICLR 2026 has proven that the gap is exponential.

Vibe Coding XR: Accelerate AI + XR prototyping with XR Blocks and Gemini

How ChatGPT Gets You Addicted

LEAVE A REPLY Cancel reply

Useful Links

Latest News

DeepMind spinout Isomorphic Labs raises $2.1 billion self.__wrap_b(“:Rl6glm:”,0.7)

Comau and OMRON Robotics partner to bring robotics to more industries

Presentation of Claude Platform on AWS: Anthropic’s native platform, via your AWS account

Our Newsletter