SQL – window functions

A beautiful sight

A window function performs the calculation of each row over a set of rows (a group / a partition / a window of rows) that they belong to.

For example, assume we have a table containing information about students in your school. The information stored includes major, class ID, gender, date of birth, score, etc. Now, we want to have the relative ranking of every student’s score in his/her class, what should we do?

This is a perfect chance for a window function to take action! What the window function does is: partition the students into class, then for each class, apply the RANK() function. Simple, isn’t it?

Apart from RANK(), there are also other window functions that are very useful. Let’s take a look at them together (, but later, after the following example).

The ROW_NUMBER() window function helps to assign a unique index for each row in a partition, follow a specified ordering scheme. The process is fully explained by the below visual.

SQL's Row Number
Window function ROW_NUMBER

List of window functions

ROW_NUMBER(): row-count within its partition (from 1 to n). Example.

RANK(): the rank of the current row in its partition, with gaps. E.g. RANK() on [-3, 2, 2, 6, 8, 8] gives [1, 2, 2, 4, 5, 5]. Example.

DENSE_RANK(): the rank of the current row in its partition, without gaps. E.g. RANK() on [-3, 2, 2, 6, 8, 8] gives [1, 2, 2, 3, 4, 4]. Example.

PERCENT_RANK(): equals to applying a RANK() and then normalize the values to the range [0, 1] (i.e. equal (rank – 1) / (number of rows in partition – 1)). Example.

CUME_DIST(): the cumulative distribution. It equals (the number of rows smaller than or equal to it, including itself) / (the total number of rows in its partition). Example.

NTILE(n_buckets): each partition is divided into a number of buckets, then all rows belonging to a bucket is marked by the index number of that bucket. If the number of rows is not divisible by the number of buckets, bucket sizes will have a difference of at most 1, while the bigger buckets precede the smaller ones. Example.

LAG(expression, offset, default value): return the expression evaluated at the row that is offset rows before the current row. If that row does not exist, the default value will be returned. If not specified, offset is 1 and the default value is NULL. Example.

LEAD(expression, offset, default value): return the expression evaluated at the row that is offset rows after the current row. If that row does not exist, the default value will be returned. If not specified, offset is 1 and the default value is NULL. Example.

FIRST_VALUE(expression): the expression evaluated at the first row of the partition (the window). Caution. Example.

LAST_VALUE(expression): the expression evaluated at the last row of the partition (the window). Caution. Example.

NTH_VALUE(express, n-th): the expression evaluated at the n-th row of the partition (the window). NULL is returned if no such row. Caution. Example.

Caution: FIRST_VALUE, LAST_VALUE, and NTH_VALUE, by default, only consider the window from the start of the partition to the last peer of the current row. This works well for FIRST_VALUE, however, it may induce some NULL for NTH_VALUE and some wrong values for LAST_VALUE. To solve the problem, add RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING into the OVER clause. See example and example.

Notes:

  • Window functions are featured by the clause OVER.
  • The PARTITION BY inside OVER is optional. If there is no PARTITION BY, the function is executed over the whole data (i.e. there is only 1 partition).

Aggregate functions as window functions

All the aggregate functions (SUM, COUNT, MIN, MAX, etc.) can be used as window functions as well. The keyword is OVER.

For example, we may use the ARRAY_AGG either by the syntax:

SELECT ARRAY_AGG(expression_1) OVER (PARTITION BY expression_2);

or:

SELECT ARRAY_AGG(expression_1) OVER (PARTITION BY expression_2 ORDER BY expression_3);

What is the difference? That is about the window. For the first syntax, the window for executing each row is the entire partition. For the second syntax, the window for executing each row includes rows from the start of the partition to the last peer of the current row.

Take a look at the 2 syntaxes in action:

SELECT inventory_id AS inventory
     , customer_id AS customer
     , EXTRACT(MONTH FROM rental_date) AS month
     , ARRAY_AGG(customer_id) OVER (PARTITION BY inventory_id) AS ARR
FROM rental
LIMIT 10;
inventorycustomermontharr
12798{279,518,431}
15188{279,518,431}
14317{279,518,431}
23598{359,581,161,170,411}
25817{359,581,161,170,411}
21617{359,581,161,170,411}
21706{359,581,161,170,411}
24115{359,581,161,170,411}
35418{541,39}
3397{541,39}
SELECT inventory_id AS inventory
     , customer_id AS customer
     , EXTRACT(MONTH FROM rental_date) AS month
     , ARRAY_AGG(customer_id) OVER (PARTITION BY inventory_id 
                                    ORDER BY EXTRACT(MONTH FROM rental_date)) 
                              AS ARR
FROM rental
LIMIT 10;
inventorycustomermontharr
14317{431}
15188{431,518,279}
12798{431,518,279}
24115{411}
21706{411,170}
25817{411,170,581,161}
21617{411,170,581,161}
23598{411,170,581,161,359}
3397{39}
35418{39,541}

Examples:

For illustrative purposes, we will be using the Pagila database, which is a port of the Sakila database (for MySQL) to Postgresql – the SQL version of this blog post. Pagila can be downloaded from here.

Pagila Erd Simplified
Pagila’s simplified ERD

Example ROW_NUMBER:

SELECT inventory_id
     , rental_date
     , ROW_NUMBER() OVER (PARTITION BY inventory_id ORDER BY rental_date)
FROM rental
LIMIT 10;
inventory_idrental_daterow_number
12005-07-08 19:03:151
12005-08-02 20:13:102
12005-08-21 21:27:433
22005-05-30 20:21:071
22005-06-17 20:24:002
22005-07-07 10:41:313
22005-07-30 22:02:344
22005-08-23 01:01:015
32005-07-31 21:36:071
32005-08-22 23:56:372

Example RANK:

SELECT staff_id
     , payment_date
     , RANK() OVER (PARTITION BY staff_id ORDER BY payment_date)
FROM payment
LIMIT 10;
staff_idpayment_daterank
12017-01-24 21:21:561
12017-01-24 21:21:561
12017-01-24 21:33:073
12017-01-24 21:33:073
12017-01-24 21:33:475
12017-01-24 21:33:475
12017-01-24 21:36:337
12017-01-24 21:36:337
12017-01-24 22:29:069
12017-01-24 22:29:069

Example DENSE_RANK:

SELECT staff_id
     , payment_date
     , DENSE_RANK() OVER (PARTITION BY staff_id ORDER BY payment_date)
FROM payment
LIMIT 10;
staff_idpayment_datedense_rank
12017-01-24 21:21:561
12017-01-24 21:21:561
12017-01-24 21:33:072
12017-01-24 21:33:072
12017-01-24 21:33:473
12017-01-24 21:33:473
12017-01-24 21:36:334
12017-01-24 21:36:334
12017-01-24 22:29:065
12017-01-24 22:29:065

Example PERCENT_RANK:

SELECT staff_id
     , payment_date
     , PERCENT_RANK() OVER (PARTITION BY staff_id ORDER BY payment_date)
FROM payment
LIMIT 10;
staff_idpayment_datepercent_rank
12017-01-24 21:21:560
12017-01-24 21:21:560
12017-01-24 21:33:071.2491412154144027E-4
12017-01-24 21:33:071.2491412154144027E-4
12017-01-24 21:33:472.4982824308288055E-4
12017-01-24 21:33:472.4982824308288055E-4
12017-01-24 21:36:333.7474236462432077E-4
12017-01-24 21:36:333.7474236462432077E-4
12017-01-24 22:29:064.996564861657611E-4
12017-01-24 22:29:064.996564861657611E-4

Example CUME_DIST:

SELECT staff_id
     , payment_date
     , CUME_DIST() OVER (PARTITION BY staff_id ORDER BY payment_date)
FROM payment
LIMIT 10;
staff_idpayment_datecume_dist
12017-01-24 21:21:561.2490632025980515E-4
12017-01-24 21:21:561.2490632025980515E-4
12017-01-24 21:33:072.498126405196103E-4
12017-01-24 21:33:072.498126405196103E-4
12017-01-24 21:33:473.7471896077941546E-4
12017-01-24 21:33:473.7471896077941546E-4
12017-01-24 21:36:334.996252810392206E-4
12017-01-24 21:36:334.996252810392206E-4
12017-01-24 22:29:066.245316012990258E-4
12017-01-24 22:29:066.245316012990258E-4

Example NTILE:

SELECT inventory_id
     , rental_date
     , NTILE(2) OVER (PARTITION BY inventory_id ORDER BY rental_date)
FROM rental
LIMIT 10;
inventory_idrental_datentile
12005-07-08 19:03:151
12005-08-02 20:13:101
12005-08-21 21:27:432
22005-05-30 20:21:071
22005-06-17 20:24:001
22005-07-07 10:41:311
22005-07-30 22:02:342
22005-08-23 01:01:012
32005-07-31 21:36:071
32005-08-22 23:56:372

Example LAG:

SELECT inventory_id
     , rental_date
     , LAG(rental_date, 1) OVER (PARTITION BY inventory_id ORDER BY rental_date) AS previous_rent
FROM rental
LIMIT 10;
inventory_idrental_dateprevious_rent
12005-07-08 19:03:15
12005-08-02 20:13:102005-07-08 19:03:15
12005-08-21 21:27:432005-08-02 20:13:10
22005-05-30 20:21:07
22005-06-17 20:24:002005-05-30 20:21:07
22005-07-07 10:41:312005-06-17 20:24:00
22005-07-30 22:02:342005-07-07 10:41:31
22005-08-23 01:01:012005-07-30 22:02:34
32005-07-31 21:36:07
32005-08-22 23:56:372005-07-31 21:36:07

Example LEAD:

SELECT inventory_id
     , rental_date
     , LEAD(rental_date, 1) OVER (PARTITION BY inventory_id ORDER BY rental_date) AS next_rent
FROM rental
LIMIT 10;
inventory_idrental_datenext_rent
12005-07-08 19:03:152005-08-02 20:13:10
12005-08-02 20:13:102005-08-21 21:27:43
12005-08-21 21:27:43
22005-05-30 20:21:072005-06-17 20:24:00
22005-06-17 20:24:002005-07-07 10:41:31
22005-07-07 10:41:312005-07-30 22:02:34
22005-07-30 22:02:342005-08-23 01:01:01
22005-08-23 01:01:01
32005-07-31 21:36:072005-08-22 23:56:37
32005-08-22 23:56:37

Example FIRST_VALUE:

SELECT inventory_id
     , rental_date
     , FIRST_VALUE(rental_date) OVER (PARTITION BY inventory_id ORDER BY rental_date) AS first_rent
FROM rental
LIMIT 10;
inventory_idrental_datefirst_rent
12005-07-08 19:03:152005-07-08 19:03:15
12005-08-02 20:13:102005-07-08 19:03:15
12005-08-21 21:27:432005-07-08 19:03:15
22005-05-30 20:21:072005-05-30 20:21:07
22005-06-17 20:24:002005-05-30 20:21:07
22005-07-07 10:41:312005-05-30 20:21:07
22005-07-30 22:02:342005-05-30 20:21:07
22005-08-23 01:01:012005-05-30 20:21:07
32005-07-31 21:36:072005-07-31 21:36:07
32005-08-22 23:56:372005-07-31 21:36:07

Example LAST_VALUE:

SELECT inventory_id
     , rental_date
     , LAST_VALUE(rental_date) OVER (PARTITION BY inventory_id ORDER BY rental_date
                                     RANGE BETWEEN UNBOUNDED PRECEDING
                                               AND UNBOUNDED FOLLOWING)      
                               AS last_rent
FROM rental
LIMIT 10;
inventory_idrental_datelast_rent
12005-07-08 19:03:152005-08-21 21:27:43
12005-08-02 20:13:102005-08-21 21:27:43
12005-08-21 21:27:432005-08-21 21:27:43
22005-05-30 20:21:072005-08-23 01:01:01
22005-06-17 20:24:002005-08-23 01:01:01
22005-07-07 10:41:312005-08-23 01:01:01
22005-07-30 22:02:342005-08-23 01:01:01
22005-08-23 01:01:012005-08-23 01:01:01
32005-07-31 21:36:072005-08-22 23:56:37
32005-08-22 23:56:372005-08-22 23:56:37

Example NTH_VALUE:

SELECT inventory_id
     , rental_date
     , NTH_VALUE(rental_date, 2) OVER (PARTITION BY inventory_id ORDER BY rental_date
                                       RANGE BETWEEN UNBOUNDED PRECEDING
                                                 AND UNBOUNDED FOLLOWING) 
                                 AS second_rent
FROM rental
LIMIT 10;
inventory_idrental_datesecond_rent
12005-07-08 19:03:152005-08-02 20:13:10
12005-08-02 20:13:102005-08-02 20:13:10
12005-08-21 21:27:432005-08-02 20:13:10
22005-05-30 20:21:072005-06-17 20:24:00
22005-06-17 20:24:002005-06-17 20:24:00
22005-07-07 10:41:312005-06-17 20:24:00
22005-07-30 22:02:342005-06-17 20:24:00
22005-08-23 01:01:012005-06-17 20:24:00
32005-07-31 21:36:072005-08-22 23:56:37
32005-08-22 23:56:372005-08-22 23:56:37

References:

  • Postgresql’s functions window: link
  • Postgresqltutorial’s ntile: link

Leave a Reply