SQL – window functions

A window function performs the calculation of each row over a set of rows (a group / a partition / a window of rows) that they belong to.

For example, assume we have a table containing information about students in your school. The information stored includes major, class ID, gender, date of birth, score, etc. Now, we want to have the relative ranking of every student’s score in his/her class, what should we do?

This is a perfect chance for a window function to take action! What the window function does is: partition the students into class, then for each class, apply the RANK() function. Simple, isn’t it?

Apart from RANK(), there are also other window functions that are very useful. Let’s take a look at them together (, but later, after the following example).

The ROW_NUMBER() window function helps to assign a unique index for each row in a partition, follow a specified ordering scheme. The process is fully explained by the below visual.

List of window functions

ROW_NUMBER(): row-count within its partition (from 1 to n). Example.

RANK(): the rank of the current row in its partition, with gaps. E.g. RANK() on [-3, 2, 2, 6, 8, 8] gives [1, 2, 2, 4, 5, 5]. Example.

DENSE_RANK(): the rank of the current row in its partition, without gaps. E.g. RANK() on [-3, 2, 2, 6, 8, 8] gives [1, 2, 2, 3, 4, 4]. Example.

PERCENT_RANK(): equals to applying a RANK() and then normalize the values to the range [0, 1] (i.e. equal (rank – 1) / (number of rows in partition – 1)). Example.

CUME_DIST(): the cumulative distribution. It equals (the number of rows smaller than or equal to it, including itself) / (the total number of rows in its partition). Example.

NTILE(n_buckets): each partition is divided into a number of buckets, then all rows belonging to a bucket is marked by the index number of that bucket. If the number of rows is not divisible by the number of buckets, bucket sizes will have a difference of at most 1, while the bigger buckets precede the smaller ones. Example.

LAG(expression, offset, default value): return the expression evaluated at the row that is offset rows before the current row. If that row does not exist, the default value will be returned. If not specified, offset is 1 and the default value is NULL. Example.

LEAD(expression, offset, default value): return the expression evaluated at the row that is offset rows after the current row. If that row does not exist, the default value will be returned. If not specified, offset is 1 and the default value is NULL. Example.

FIRST_VALUE(expression): the expression evaluated at the first row of the partition (the window). Caution. Example.

LAST_VALUE(expression): the expression evaluated at the last row of the partition (the window). Caution. Example.

NTH_VALUE(express, n-th): the expression evaluated at the n-th row of the partition (the window). NULL is returned if no such row. Caution. Example.

Caution: FIRST_VALUE, LAST_VALUE, and NTH_VALUE, by default, only consider the window from the start of the partition to the last peer of the current row. This works well for FIRST_VALUE, however, it may induce some NULL for NTH_VALUE and some wrong values for LAST_VALUE. To solve the problem, add RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING into the OVER clause. See example and example.

Notes:

• Window functions are featured by the clause OVER.
• The PARTITION BY inside OVER is optional. If there is no PARTITION BY, the function is executed over the whole data (i.e. there is only 1 partition).

Aggregate functions as window functions

All the aggregate functions (SUM, COUNT, MIN, MAX, etc.) can be used as window functions as well. The keyword is `OVER`.

For example, we may use the ARRAY_AGG either by the syntax:

`SELECT ARRAY_AGG(expression_1) OVER (PARTITION BY expression_2);`

or:

`SELECT ARRAY_AGG(expression_1) OVER (PARTITION BY expression_2 ORDER BY expression_3);`

What is the difference? That is about the window. For the first syntax, the window for executing each row is the entire partition. For the second syntax, the window for executing each row includes rows from the start of the partition to the last peer of the current row.

Take a look at the 2 syntaxes in action:

```SELECT inventory_id AS inventory
, customer_id AS customer
, EXTRACT(MONTH FROM rental_date) AS month
, ARRAY_AGG(customer_id) OVER (PARTITION BY inventory_id) AS ARR
FROM rental
LIMIT 10;```
```SELECT inventory_id AS inventory
, customer_id AS customer
, EXTRACT(MONTH FROM rental_date) AS month
, ARRAY_AGG(customer_id) OVER (PARTITION BY inventory_id
ORDER BY EXTRACT(MONTH FROM rental_date))
AS ARR
FROM rental
LIMIT 10;```

Examples:

For illustrative purposes, we will be using the Pagila database, which is a port of the Sakila database (for MySQL) to Postgresql – the SQL version of this blog post. Pagila can be downloaded from here.

Example ROW_NUMBER:

```SELECT inventory_id
, rental_date
, ROW_NUMBER() OVER (PARTITION BY inventory_id ORDER BY rental_date)
FROM rental
LIMIT 10;```

Example RANK:

```SELECT staff_id
, payment_date
, RANK() OVER (PARTITION BY staff_id ORDER BY payment_date)
FROM payment
LIMIT 10;```

Example DENSE_RANK:

```SELECT staff_id
, payment_date
, DENSE_RANK() OVER (PARTITION BY staff_id ORDER BY payment_date)
FROM payment
LIMIT 10;```

Example PERCENT_RANK:

```SELECT staff_id
, payment_date
, PERCENT_RANK() OVER (PARTITION BY staff_id ORDER BY payment_date)
FROM payment
LIMIT 10;```

Example CUME_DIST:

```SELECT staff_id
, payment_date
, CUME_DIST() OVER (PARTITION BY staff_id ORDER BY payment_date)
FROM payment
LIMIT 10;```

Example NTILE:

```SELECT inventory_id
, rental_date
, NTILE(2) OVER (PARTITION BY inventory_id ORDER BY rental_date)
FROM rental
LIMIT 10;```

Example LAG:

```SELECT inventory_id
, rental_date
, LAG(rental_date, 1) OVER (PARTITION BY inventory_id ORDER BY rental_date) AS previous_rent
FROM rental
LIMIT 10;```

```SELECT inventory_id
, rental_date
, LEAD(rental_date, 1) OVER (PARTITION BY inventory_id ORDER BY rental_date) AS next_rent
FROM rental
LIMIT 10;```

Example FIRST_VALUE:

```SELECT inventory_id
, rental_date
, FIRST_VALUE(rental_date) OVER (PARTITION BY inventory_id ORDER BY rental_date) AS first_rent
FROM rental
LIMIT 10;```

Example LAST_VALUE:

```SELECT inventory_id
, rental_date
, LAST_VALUE(rental_date) OVER (PARTITION BY inventory_id ORDER BY rental_date
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS last_rent
FROM rental
LIMIT 10;```

Example NTH_VALUE:

```SELECT inventory_id
, rental_date
, NTH_VALUE(rental_date, 2) OVER (PARTITION BY inventory_id ORDER BY rental_date
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS second_rent
FROM rental
LIMIT 10;```

References: