Go from nothing to a complete dataframe with Python

Photo by Joshua Sortino on Unsplash.

After submitting a recent article to Towards Data Science’s editorial team, I received a message back with a simple inquiry: are the datasets licensed for commercial use? It was a great question — the datasets in my draft came from Seaborn, a common Python Library that comes complete with 17 sample datasets [1]. The datasets certainly seemed open source and, sure enough, many had easily discoverable licenses authorizing commercial use. Unfortunately for me, I happened to pick one of the few datasets that I couldn’t find a license for. But instead of switching to a different Seaborn dataset, I decided to make my own Synthetic Data.

What is Synthetic Data?

IBM’s Kim Martineau defines Synthetic Data as “information that’s been generated on a computer to augment or replace real data to improve AI models, protect sensitive data, and mitigate bias” [2].

Synthetic Data may look like information from a real-world event, but it’s not. This avoids licensing issues, hides proprietary data, and protects personal information.

Synthetic Data differs from anonymized or masked data, which takes real data from actual events and alters certain fields to make the data non-attributional. If you’re looking for anonymizing names in data, you can read a how-to on name anonymization here.

Synthetic Data does not need to be perfect. In my previous article’s use case, I was writing a guide on how to use the Python GroupBy() function. All I needed was a dataset that had numeric data, categorical data, and a domain (in this case, student test scores and grades) understandable to the reader to help me deliver the message. Based on the work for that article, below I’ll provide a guide on building a Synthetic Dataset of your own.

Code:

The Jupyter notebook with full Python code used in this walkthrough is available at the linked github page. Download or clone the repository to follow along!

The code requires the following libraries:

# Data Handling
import pandas as pd
import numpy as np

# Data visualization
import plotly.express as px

# Anonymizer:
from faker import Faker