Anonymizing Production Data for Data Science with Mimesis

Introduction

Production data plays a critical role in the development of data-driven products and services, yet it often comes with strict confidentiality and compliance requirements. Anonymizing this data is essential in real-world data science projects to protect sensitive information while still deriving valuable insights. This article introduces Mimesis, an open-source Python library renowned for its capability to generate realistic fake data, which can be used effectively to anonymize sensitive production data. This step-by-step guide will walk you through the process using Mimesis, allowing you to replicate it in your own integrated development environment (IDE) or notebook.

Step-by-Step Procedure

First, you need to install Mimesis in your Python environment. For Google Colab or similar environments, prepend the pip command with an exclamation mark:

In this guide, we’ll simulate a simple subscription system for a software product, generating a toy dataset with customer data. This dataset includes sensitive fields such as usernames, emails, and phone numbers:

import pandas as pd



production_data = {

    'user_id': [101, 102, 103, 104],

    'real_name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Diana Prince'],

    'e-mail': ['alice.smith@corp.com', 'bjones@startup.io', 'cbrown@domain.org', 'diana@amazon.com'],

    'phone': ['555-0100', '555-0101', '555-0102', '555-0103'],

    'tier_subscription': ['Premium', 'Basic', 'Basic', 'Enterprise']

}



df = pd.DataFrame(production_data)

print("--- Original sensitive data ---")

print(df.head())

Although subscription tiers may not be sensitive, personal identifiers like names and emails are. By using Mimesis, we can create a supplier: a custom data anonymization model. The Person class within Mimesis allows for the generation of fake personal data, using specific locales and seeds to ensure reproducibility:

from mimesis import Person

from mimesis.locales import Locale



person = Person(locale=Locale.EN, seed=42)

Next, we’ll anonymize the personally identifiable information (PII) by replacing sensitive columns with data generated by Mimesis. This involves iterating over the DataFrame and applying Mimesis functions tailored to each data type:

#1. Replace real names with realistic fake names

df['real_name'] = [person.full_name() for _ in range(len(df))]



#2. Replace real emails with fake

df['email'] = [person.email() for _ in range(len(df))]



#3. Replacing real df phone numbers

df['phone'] = [person.telephone() for _ in range(len(df))]



#4. Rename the column to indicate that it is no longer the real name

df.rename(columns={'real_name': 'anon_name'}, inplace=True)

Mimesis provides dedicated functions for generating full names, emails, and phone numbers. The column name is also updated to reflect that the data has been anonymized. We can now verify the anonymized data by examining the modified DataFrame:

print("n--- Anonymized data for data science analyzes ---")

print(df.head())

The output will show anonymized data, replacing sensitive fields with convincing synthetic data while maintaining the structure and critical information for analysis, such as subscription tiers:

--- Anonymized data for data science analytics ---

user_id    anon_name        email              phone

0          101        Anthony Reilly  archived1911@duck.com    +13312271333

1          102        Kai Day         suspect2087@yahoo.com    +1-205-759-3586

2          103        Cleveland Osborn  urgent1912@yahoo.com   +13691067988

3          104        Zack Holder      johnson1881@example.com +1-574-481-3676

subscription_tier

0          Premium

1          Basic

2          Basic

3          Enterprise

Using Mimesis, we’ve successfully anonymized several sensitive data fields, demonstrating a straightforward method to protect privacy in data science projects. As Mimesis is open-source, it offers this functionality freely.

Best Practices and Observations

Consider whether to replace columns in the existing DataFrame or store new information separately, depending on the need to preserve original data.

Mimesis generates data consistent with expected types, ensuring seamless integration into existing pipelines.

Using seeds enhances reproducibility, allowing for consistent data generation across different runs.

Conclusion

This article has showcased how Mimesis can transform sensitive production datasets into anonymized versions suitable for safe analysis. By leveraging this powerful Python library, data scientists can protect private information without sacrificing analytical capabilities.

Ivan Palomares Carrascosa is a respected leader, writer, speaker, and advisor in AI, machine learning, deep learning, and LLM. He is dedicated to educating and guiding others in the practical applications of AI.

For more information, visit the source link: Here.

“`

US investor Lockheed Martin Ventures commits at least €87 million to Europe as it opens new UK office

With new funding, Monumental plans to bring its construction robots to the United States

This portable gaming PC deal makes the MSI Claw 8 much easier to recommend

Bunkerhill raises $55M to scale agent AI across healthcare system

Anonymizing Production Data for Data Science with Mimesis

Introduction

Step-by-Step Procedure

Best Practices and Observations

Conclusion

US investor Lockheed Martin Ventures commits at least €87 million to Europe as it opens new UK office

With new funding, Monumental plans to bring its construction robots to the United States

This portable gaming PC deal makes the MSI Claw 8 much easier to recommend

Bunkerhill raises $55M to scale agent AI across healthcare system

I turned off this HDMI setting, and my TV finally stopped glitching

Introducing Nested Learning: a new ML paradigm for continuous learning

Your AI agent says “Done!” » — Here’s how to tell if it’s a lie

Towards a demystification of the creativity of diffusion models

5 Real-World SQL Projects to Build Your Data Portfolio

Extension of our CoWork agent with a Cortex agent skill.

LEAVE A REPLY Cancel reply

Useful Links

Latest News

With new funding, Monumental plans to bring its construction robots to the United States

This portable gaming PC deal makes the MSI Claw 8 much easier to recommend

Bunkerhill raises $55M to scale agent AI across healthcare system

Our Newsletter