Introduction
The statements we write inside methods like When(), filter(), select() etc are called expressions.
In pyspark, there are 2 types of expressions:
- SQL expression
- Dataframe API expression
I'm using a simple example - from the dataframe df, get the rows where amount is greater than or equal to 200.
SQL EXPRESSIONS (STRING EXPRESSIONS)
SQL expressions are simple SQL compliant strings. These are the easiest to use, and very widely understood.
Example -
filtered_df = df.where("amount>=200")
SQL expressions are not case sensitive. This makes them very forgiving to use.
filtered_df = df.where("AmOuNt>=100 AnD AmOuNt < 400")
Most functions accept SQL expressions. If you're using a function that doesn't accept SQL expressions, and only accepts column expressions - use the expr("<sql expression string>") method to convert the sql expression into column expression.
Example using expr():
import pyspark.sql.functions as F
# Convert sql expression into column expression
filtered_df = df.where(F.expr("amount>=200"))
DATAFRAME API EXPRESSIONS
Dataframe API expressions are pythonic expressions. We build these using the functions provided by the pySpark API.
All functions accept Dataframe API expressions.
Example -
filtered_df = df.where(col("amount")>=200)
When using Dataframe API expressions, there are 3 ways to refer to a dataframe's column:
- Using the col()(recommended)
- Directly referring to the dataframe's column variable
- Referring to the column through the dataframe's dictionary
Refer using col() (recommended)
filtered_df = df.where(col("amount")>=100)
The col() method is not case sensitive. So, as long as the spelling is correct, the col() method will resolve the column perfectly.
Examples:
- WORKS - col("amount")
- WORKS - col("Amount")
- WORKS - col("AMOUNT")
- WORKS - col("AmOuNt")
- DOESN'T WORK - col("amounttttt")(spelling is wrong)Refer using dataframe's column variable
filtered_df = df.where(df.amount>=100)
This method is case sensitive. Here, we are directly referring to the variable within the dataframe object. In python, variable names are case sensitive. So we have to use the exact uppercase-lowercase to access the variable.
Examples:
- WORKS - df.amount
- DOESN'T WORK - df.Amount(case is different)
- DOESN'T WORK - df.AMOUNT(case is different)
- DOESN'T WORK - df.AmOuNt(case is different)
- DOESN'T WORK - df.amounttttt(spelling is wrong)
Refer using dataframe's dictionary
filtered_df = df.where(df["amount"]>=100)
This method is not case sensitive. Surprisingly.
As long as the spelling is correct, the col() method will resolve the column perfectly.
Examples:
- WORKS - df["amount"]
- WORKS - df["Amount"]
- WORKS - df["AMOUNT"]
- WORKS - df["AmOuNt"]
- DOESN'T WORK - df["amounttttt"]<- doesn't work because spelling is different
BEST PRACTICE
Using SQL expressions is the easiest way in most cases.
However, some teams exclusively use Dataframe API expressions in their codebases.
In case you're using column expressions, you have to choose a standard way to refer to columns in the dataframe. It is best practice to use col() to refer to a column.
# RECOMMENDED
filtered_df = df.where(col("amount")>=100)
That's it. Enjoy.