Get used to large-scale data processing with PySpark

Photo by fabio on Unsplash

PySpark is the Python API for Spark, which is an analytics engine used for large-scale data processing. Spark has become the predominant tool in the data science ecosystem especially when we deal with large datasets that are difficult to handle with tools like Pandas and SQL.

In this article, we’ll learn PySpark but from a different perspective than most of the other tutorials. Instead of going over frequently used PySpark functions and explaining how to use them, we’ll solve some challenging data cleaning and processing tasks. This way of learning not only helps us learn PySpark functions but also know when to use them.

Before we start with the examples, let me tell you how to get the dataset used in the examples. It’s a sample dataset I prepared with mock data. You can download from my datasets repository — it’s called “sample_sales_pyspark.csv”.

Let’s start with creating a DataFrame from this dataset.

from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F

spark = SparkSession.builder.getOrCreate()

data ="sample_sales_pyspark.csv", header=True)
# output
| B1| 89912|2021-05-01| 14| 17654| 1261|
| B1| 89912|2021-05-02| 19| 24282| 1278|
| B1| 89912|2021-05-03| 15| 19305| 1287|
| B1| 89912|2021-05-04| 21| 28287| 1347|
| B1| 89912|2021-05-05| 4| 5404| 1351|

PySpark allows for using SQL code through its pyspark.sql module. It’s highly practical and intuitive to use SQL code for some data preprocessing tasks such as changing column names and data types.

The selectExpr function makes it very simple to do these operations especially if you have some experience with SQL.