tl;dr – Use .groupby and .transform to achieve SQL-like window functionality in Pandas

Slicing and dicing data using Pandas and SQL is a key skillset for any Data Scientist.

Data Scientists will typically work with both SQL and Pandas for transforming data depending on the specific use case and where the data is stored. For example, SQL is typically used when working directly with data in a database i.e. creating new tables or views of the data. Pandas is used more often when preprocessing data for machine learning or visualisation workflows.

It is important to be fluent with both SQL and Pandas syntax to be able to seamlessly transform your data for the business question you are trying to answer.

However, there are cases where carrying out certain transformations are easier to express in one language than the other.

For me, this seems to always happen when trying to carry out transformations involving Window Functions using Pandas. I don’t know what it is, but I find the SQL syntax for window functions to be much more intuitive.

In this post, I will describe the differences between SQL and Pandas syntax for applying window function calculations to different partitions of your data.

What is a Window Function?

“A window function performs a calculation across a set of table rows that are somehow related to the current row.” Postgres documentation

Window functions allow users to perform aggregations and calculations against different cross-sections (partitions) of the data. In contrast to standard aggregation functions (e.g. sum, mean, count etc.) which return a single value for each partition defined in the query, window functions return a value for each row in the original table.

When would you use a Window Function?

I find there are two main scenarios in the Data Science workflow that involve window functions:

  • interpolating or imputing missing values based on statistical properties of other values in the same group
  • creating new features for ML models

When you have panel data , window functions are useful for augmenting your dataset with new features that represent aggregate properties of specific partitions within your data.

For example, imagine you are creating a model to predict the price of houses in your neighbourhood given various information about each house such as the house type (e.g. detached, apartment etc.), floor area, number of bedrooms, number of bathrooms etc.

One feature you might want to create for your model is to compare the floor area of the current house to the median area for a house of that specific type. The hypothesis being: if the house has a larger floor area than the median area for a comparable house type, then it is more likely to be priced at the top end of the market.

In order to calculate this feature we would need to create a new column that, for each row, contains the median house price for the current row’s house type. We could then calculate the difference between the house’s area and its group median to use as a feature in our model.

This new feature would require a window function that would calculate and return a value for each row in the dataset with the median house price of that group.

Window Functions in SQL

The Postgres database documentation has a great tutorial on window functions.

Here is a quick recap.

To form a window function in SQL you need three parts:

  • an aggregation function or calculation to apply to the target column (e.g. SUM(), RANK())
  • the OVER() keyword to initiate the window function
  • the PARTITION BY keyword which defines which data partition(s) to apply the aggregation function
  • (optional) the ORDER BY keyword to define the required sorting within each data partition. For example, if the order of the rows affects the value of the calculation

Window Functions in Pandas

.groupby is the basis of window functions in Pandas

I think my confusion when trying to translate SQL window functions to Pandas stems from the fact that you don’t explicitly use the GROUP BY keyword for window functions in SQL. Whereas, in Pandas, the basis of window functions is the .groupby function.

.groupby in Pandas is analogous to the PARTITION BY keyword in SQL. The groupby clause in Pandas defines which partitions (groups) in the data the aggregation function should be applied to.

.transform allows you to apply complex transformations

In addition to specifying the data partitions using .groupby, we need to define which aggregation or calculation to apply to the data partitions. This can be achieved using the .transform function.

The .transform function takes a function (i.e. the function which calculates the desired quantity) as an argument and returns a dataframe with the same length as the original.

The function supplied to .transform can either be a string (for simple aggregation functions like ‘sum’, ‘mean’, ‘count’) or a callable function (i.e. lambda function ) for more complex operations.

You don’t always have to use .transform

If the function you are applying to your data is not an aggregation function, i.e. it naturally returns a value (row) for every row in the dataframe, rather than a single row, you don’t need to use the .transform keyword.

For example, if you want to create a new column with the values within each group shifted by one (i.e. in a timeseries, the previous day’s value) you can omit the .transform function and simply use the builtin Pandas function instead – in this case the .shift() function. This is because shifting the data naturally returns a value for each row, rather than a single aggregated value for the group.

Examples

The best way to explain the differences is by demonstrating with some examples.

For this window functions demo we will use a timeseries dataset of stock prices for various companies. The code below shows some simple setup for collecting this data.

💻 There is an accompanying notebook for this blog post available in the Engineering For Data Science GitHub repo if you want to try out the examples yourself.

Setup

Install requirements if necessary

# install required libraries if necessary
pip install matplotlib pandas ffn

Import Libraries

import datetime
import random

import ffn
import matplotlib.pyplot as plt

import pandas as pd

%matplotlib inline

Collect Data

We will use one of my favourite libraries for financial analysis – the ffn library – to collect the data in one line of code.

tickers = [
    "AAPL",  # apple
    "DIS",  # disney
    "NKE",  # nike
    "TSLA",  # tesla
]

# get stock price data
prices = ffn.get(tickers, start="2018-01-01")

# convert data into 'long' table format for purposes of this exercise
prices = prices.melt(ignore_index=False, var_name="ticker", value_name="closing_price")

# reset index to make 'Date' a column
prices = prices.reset_index()

# display 5 example rows in the dataset
prices.sample(5).sort_index()
Example data

Save to SQLite Database

In order to demonstrate and compare the SQL syntax to Pandas we will save this data in a simple in-memory SQLite database. This allows us to use directly SQL syntax to read and transform the data.

import sqlite3

# create connection to in memory sqlite db
with sqlite3.connect(":memory:") as conn:

    # save prices dataframe to sqlite db
    prices.to_sql(name="prices", con=conn, index=False)

Let’s take some examples using the stock prices dataset to compare the differences between window functions in SQL and the equivalent in Pandas.

Example 1: Calculating the maximum stock price for each company in the time period

SQL

We calculate the maximum stock price for each ticker by using the MAX function in conjunction with a window function:

  • the window function is initiated with the OVER keyword
  • we specify that the ticker column should be used to partition the data

This will return a table of the same length as the original, but with a new column called max_price that contains the maximum stock price the relevant company during time period the data covers.

ex1_sql_query = """
SELECT
    date(Date) as Date
    , ticker
    , closing_price
    , MAX(closing_price) OVER(PARTITION BY ticker) as max_price
FROM
    prices
"""
# use pandas read_sql to execute the query and return a dataframe
ex1_sql = pd.read_sql(ex1_sql_query, con=conn)
ex1_sql
Exercise 1 SQL

Pandas

We use the groupby function to partition the data by ticker and provide max to the transform function to get a new column which shows the maximum share price for that ticker.

Note that as max is a simple aggregation function, we can simply pass it as a string to the transform function instead of providing a new function (e.g. lambda function).

# copy dataframe to avoid overwritting original (optional)
ex1_pandas = prices.copy()

# add new column
ex1_pandas["max_price"] = ex1_pandas.groupby("ticker")["closing_price"].transform("max")

ex1_pandas
Exercise 1 Pandas

Example 2: 28 day closing price moving average for each company

SQL

When calculating a moving average, the order of values within the group are important (they should be in chronological order), therefore we need to order the values within the group by the date.

Additionally, in SQL, to define the rolling window we specify that the average should be calculated using the preceding 27 rows and the current row (28 in total).

ex2_sql_query = """
SELECT
    date(Date) AS Date
    , ticker
    , closing_price
    , AVG(closing_price) OVER(
        PARTITION BY ticker
        ORDER BY date(Date)
        ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
    )
     AS ma_28_day
FROM
    prices
"""
ex2_sql = pd.read_sql(ex2_sql_query, con=conn)
ex2_sql
Exercise 2 SQL

Pandas

To achieve the same in Pandas we can create a new column in the dataframe (‘ma_28_day’) using groupby and transform. We utilise Python’s lambda syntax to define what function should be applied to each group. In this case we want to use calculate the average (mean) over a 28 row rolling window.

Remember to sort the Pandas dataframe before applying the window function when using calculations which are sensitive to the ordering of rows

# copy original dataframe (optional)
ex2_pandas = prices.copy()

# add new column
ex2_pandas["ma_28_day"] = (
    ex2_pandas.sort_values("Date")
    .groupby("ticker")["closing_price"]
    .transform(lambda x: x.rolling(28, min_periods=1).mean())
)

ex2_pandas
Exercise 2 Pandas

Rolling average window function
28 day moving average for Tesla in 2021

Example 3: Get previous day’s closing share price for each ticker

SQL

ex3_sql_query = """
SELECT
    date(Date) AS Date
    , ticker
    , closing_price
    , LAG(closing_price, 1) OVER(
        PARTITION BY ticker
        ORDER BY date(Date)
    ) AS previous_close
FROM
    prices
"""
ex3_sql = pd.read_sql(ex3_sql_query, con=conn)
ex3_sql
Exercise 3 SQL

Pandas

In this example, we don’t need to use the transform function because shift naturally returns a value for each row in the data, rather than an aggregation.

ex3_pandas = prices.copy()

ex3_pandas["previous_close"] = (
    ex3_pandas.sort_values("Date").groupby("ticker")["closing_price"].shift(1)
)

ex3_pandas
Exercise 3 Pandas

Example 4: Daily Percentage Return

SQL

In order to calculate the daily return we need to calculate the previous close in a subquery (or CTE) and then use that value to calculate the percentage change increase or daily return.

ex4_sql_query = """
SELECT
    Date
    , ticker
    , closing_price
    , closing_price/previous_close - 1 AS daily_return
FROM
    (
    SELECT
         date(Date) AS Date
        , ticker
        , closing_price
        , LAG(closing_price,1) OVER(
            PARTITION BY ticker ORDER BY date(Date)
        ) AS previous_close
    FROM
        prices
)
"""
ex4_sql = pd.read_sql(ex4_sql_query, con=conn)
ex4_sql
Exercise 4 SQL

Pandas

Here we can use the lambda function syntax to apply a more complex calculation to each group.

ex4_pandas = prices.copy()

ex4_pandas["daily_return"] = (
    ex4_pandas.sort_values("Date")
    .groupby("ticker")["closing_price"]
    .transform(lambda x: x / x.shift(1) - 1)
)

ex4_pandas
Exercise 4 Pandas

Example 5: Missing Data Interpolation

Pandas

For last this example we will just use Pandas to demonstrate how to use window type functions to interpolate missing data.

First we randomly remove some data to simulate missing data in a real world dataset.

We can then use interpolation or imputation functions to fill in the gaps within each group with a quantity that makes the most sense for the situation. For our use case we will use the ‘forward fill’ method to impute the value of the previous closing price if the data is missing.

# copy orginal dataframe
ex5_pandas = prices.copy()

# remove 30% of data randomly
pct_missing = 0.3
num_missing = int(pct_missing * len(ex5_pandas))
indexes = random.sample(range(len(ex5_pandas)), k=num_missing)
mask = [i in indexes for i in range(len(ex5_pandas))]

# mask the dataframe with some random NaNs
ex5_pandas["closing_price"] = ex5_pandas["closing_price"].mask(mask)
# interpolate missing data paritioned by ticker
ex5_pandas["closing_price_interpolated"] = (
    ex5_pandas.sort_values("Date")
    .groupby("ticker")["closing_price"]
    .transform(lambda x: x.interpolate(method="ffill"))
)

ex5_pandas
Exercise 5
# verify there is no missing data in the new column
ex5_pandas.isnull().sum()
Date                             0
ticker                           0
closing_price                 1209
closing_price_interpolated       0
dtype: int64

Conclusion

It is an important skill as a data scientist to be flexible enough to be able to carry out data transformation in either SQL or Pandas when completing data analysis.

Sometimes it is easier in one language than another to express the data transformation you want, but it is important to understand how common data transformations, such as window functions, can be achieved in either language.

The transform function in Pandas can be used to achieve SQL window function like transformations on Pandas dataframes and is a great addition to your tool box.

Happy coding!

💻 Link to the accompanying notebook on Github

Further Reading