SQL – combining data (UNION, JOIN, etc.)

A beautiful sight

In this blog post, we make it through the 2 means of combining data: row-wise (UNION, INTERSECT, etc.) and column-wise (JOIN, etc.)

Table of content
Row-wise combination
\mapsto UNION [ALL]
\mapsto INTERSECT [ALL]
\mapsto EXCEPT [ALL]
Column-wise combination
\mapsto [INNER] JOIN
\mapsto LEFT JOIN
\mapsto RIGHT JOIN
\mapsto FULL [OUTER] JOIN
\mapsto NATURAL [INNER, LEFT, RIGHT] JOIN
\mapsto CROSS JOIN
\mapsto self-JOIN


For illustrative purposes, we will be using the Pagila database, which is a port of the Sakila database (for MySQL) to Postgresql – the SQL version of this blog post. Pagila can be downloaded from here.

Pagila Erd Simplified
Pagila’s simplified ERD

Row-wise combination

Row-wise combination means we combine by appending rows of 2 or more query results together.

For example, we have the first query returns a data with r_1 rows, c columns and the second query returns a data with r_2 rows, c columns. Then, after a row-wise combination, the combined result may have (r_1 + r2) rows and c columns.

Prerequisites for a row-wise combination of 2 queries are:

  • The 2 inputs of the combination must have the same number of columns.
  • The corresponding columns must have compatible data types.

UNION [ALL]

SQL's Union Venn-diagram
UNION

The UNION operator combines data row-wise by appending all the rows together, however, for duplicated rows, only one of them is kept. The UNION ALL also appends all the rows but does not eliminate duplicated rows.

UNION vs UNION ALL
UNION vs UNION ALL

Example 1:

Get all actors whose first name is CHRISTIAN or whose last name is AKROYD. Duplicate rows should be eliminated.

  (
SELECT actor_id, first_name, last_name
FROM actor
WHERE first_name = 'CHRISTIAN'
  )
UNION
  (
SELECT actor_id, first_name, last_name
FROM actor
WHERE last_name = 'AKROYD'
  );
actor_idfirst_namelast_name
92KIRSTENAKROYD
58CHRISTIANAKROYD
182DEBBIEAKROYD
61CHRISTIANNEESON
10CHRISTIANGABLE

Example 2:

Get all actors whose first name is CHRISTIAN or whose last name is AKROYD. Duplicate rows are kept.

  (
SELECT actor_id, first_name, last_name
FROM actor
WHERE first_name = 'CHRISTIAN'
  )
UNION ALL
  (
SELECT actor_id, first_name, last_name
FROM actor
WHERE last_name = 'AKROYD'
  );
actor_idfirst_namelast_name
10CHRISTIANGABLE
58CHRISTIANAKROYD
61CHRISTIANNEESON
58CHRISTIANAKROYD
92KIRSTENAKROYD
182DEBBIEAKROYD

INTERSECT [ALL]

SQL's Intersect Venn-diagram
INTERSECT

The INTERSECT operator returns rows that appear in both inputs.

Giving an ALL after INTERSECT makes it return duplicated rows. Note that for duplicated rows, the number of duplicates appears in the result equals the minimum number of duplicates in the 2 inputs. (Look at the figure below, pay attention to the number of red rows.)

SQL's INTERSECT and INTERSECT ALL
INTERSECT vs INTERSECT ALL

Example 1:

get the customer-ids that satisfy all the below conditions:

  • customer-id < 200
  • these customers have a rental at 2017-02-14 15:16:03
  • these customers have ever made a payment of more than 10$
  • order by the customer-ids

Duplicated rows must be eliminated.

  (
SELECT customer_id
FROM rental
WHERE rental_date='2017-02-14 15:16:03' AND customer_id < 200
ORDER BY customer_id
  )
INTERSECT
  (
SELECT customer_id
FROM payment
WHERE amount > 10 AND customer_id < 200
  )
ORDER BY customer_id;
customer_id
21
29
33
87
114
163
168
178

Example 2:

The conditions are the same as in example 1, however, duplicated rows are kept.

  (
SELECT customer_id
FROM rental
WHERE rental_date='2017-02-14 15:16:03' AND customer_id < 200
ORDER BY customer_id
  )
INTERSECT ALL
  (
SELECT customer_id
FROM payment
WHERE amount > 10 AND customer_id < 200
  )
ORDER BY customer_id;
customer_id
21
29
33
87
114
163
163
168
178

EXCEPT [ALL]

SQL's Except Venn-diagram
EXCEPT

A EXCEPT B returns the elements appear in A but not appear in B. The process is:

  • Firstly, it finds all the common rows appear in both A and B. For duplicated rows, the minimum number of times appear in the 2 inputs is recorded. In other words, find the intersection of A and B.
  • Secondly, it subtracts these rows from A and takes the remaining.
  • With ALL, it returns the result. Without ALL, it eliminates duplicated rows and returns.
SQL's EXCEPT and EXCEPT ALL
EXCEPT vs EXCEPT ALL

Example 1:

get the customer-ids that satisfy:

  • customer-id < 20
  • these customers have a rental at 2017-02-14 15:16:03
  • have never had a payment of more than 10$
  • order by the customer-ids

Duplicated rows must be eliminated.

  (
SELECT customer_id
FROM rental
WHERE rental_date='2017-02-14 15:16:03' AND customer_id < 20
ORDER BY customer_id
  )
EXCEPT
  (
SELECT customer_id
FROM payment
WHERE amount > 10
  )
ORDER BY customer_id;
customer_id
5
9
11
14
15

Example 2:

The conditions are the same as in example 1, however, duplicated rows are kept.

  (
SELECT customer_id
FROM rental
WHERE rental_date='2017-02-14 15:16:03' AND customer_id < 20
ORDER BY customer_id
  )
EXCEPT ALL
  (
SELECT customer_id
FROM payment
WHERE amount > 10
  )
ORDER BY customer_id;
customer_id
5
9
11
14
15
15

Column-wise combination

Column-wise combining means we append columns of data together. We also refer to column-wise combination as a JOIN.

For example, we have the first query returns a data with r_1 rows, c_1 columns and the second query returns a data with r_2 rows, c_2 columns. Then after joining, the result may have (c_1 + c_2) columns, while the number of rows depends on the type of JOIN we use.

Let’s say we join 2 tables. If we do not state which columns are used to match each row of a table to one or more rows of the other table, every row from one table will be matched with every row from the other. That is, if the first table has r_1 rows and the second table has r_2 rows, the result will have (r_1 * r_2) rows.

Normally, we do state the matching condition, this can be done using the clause ON.

SQL's Join
JOIN

[INNER] JOIN

As the default type of join operation is the inner join, we can use JOIN and INNER JOIN interchangeably.

INNER JOIN means the result will consist of rows that appear in both input tables. This is opposite to outer join, in which rows that appear in only 1 of the 2 input tables are also valid. Look at the examples and the subsequent JOIN operations will help you be clear about these differences.

SQL's Inner Join
[INNER] JOIN

Example 1:

Show id, first name, last name and postal code of all customers whose first name is JAMIE.

SELECT customer_id AS id
     , first_name, last_name, postal_code
FROM customer JOIN address 
  ON customer.address_id=address.address_id
WHERE first_name='JAMIE';
idfirst_namelast_namepostal_code
146JAMIERICE90732
531JAMIEWAUGH88408

LEFT JOIN

LEFT JOIN is a type of outer join, which means not only the rows that appear in both inputs are included in the result. Instead, the rows that appear in the left table but not appear in the right table are also accepted. In this case, the missing values are set to NULL.

SQL's Left Join
LEFT JOIN

Example 1:

Show the language id, name of the language and the titles of films that use each language. Remember to list all the languages, even those that have no corresponding films. Order the result by language id, decreasingly. Show the first 10 rows.

SELECT language.language_id, name, film.title
FROM language LEFT JOIN film 
  ON language.language_id=film.language_id
ORDER BY language.language_id DESC
LIMIT 10;
language_idnametitle
6German
5French
4Mandarin
3Japanese
2Italian
1EnglishACADEMY DINOSAUR
1EnglishACE GOLDFINGER
1EnglishADAPTATION HOLES
1EnglishAFFAIR PREJUDICE
1EnglishAFRICAN EGG

(The blank values in column TITLE mean NULLs)

RIGHT JOIN

Similar to LEFT JOIN, RIGHT JOIN is also a type of outer join, in which rows that appear in the right input but not in the left input are also included in the result.

SQL's Right Join
RIGHT JOIN

Example 1:

Show the language id, name of the language and the titles of films that use each language. Remember to list all the languages, even those that have no corresponding films. Order the result by language id, decreasingly. Show the first 10 rows. (The same as example 1 for LEFT JOIN above.)

SELECT language.language_id, name, film.title
FROM film RIGHT JOIN language 
  ON language.language_id=film.language_id
ORDER BY language.language_id DESC
LIMIT 10;
language_idnametitle
6German
5French
4Mandarin
3Japanese
2Italian
1EnglishACADEMY DINOSAUR
1EnglishACE GOLDFINGER
1EnglishADAPTATION HOLES
1EnglishAFFAIR PREJUDICE
1EnglishAFRICAN EGG

(The blank values in column TITLE mean NULLs)

FULL [OUTER] JOIN

FULL JOIN and FULL OUTER JOIN are interchangeable, they both include in the result all rows that appear in at least one of the 2 input tables.

SQL's Full Join
FULL JOIN

Example 1:

We are curious about if there is any email belonging to both a customer and a staff. Show customer id and staff id for each email in the database.

SELECT customer_id, staff_id
FROM customer FULL JOIN staff 
  ON customer.email=staff.email;
customer_idstaff_id
1
2
502
70
35
316
445
224
377
141

(The blank values mean NULLs. There are a total of 601 rows returned.)

NATURAL [INNER, LEFT, RIGHT] JOIN

NATURAL JOIN relieves us from having to tell exactly which columns should be matched when joining. Instead, it implicitly takes all the common column names as matching criteria.

We can also specify which type of join (inner or left or right) to be executed. By default, NATURAL JOIN assumes an inner join.

SQL's Natural Join
NATURAL JOIN

Example 1:

Get all information from a natural (inner) join of the customer and rental tables.

These 2 tables have a common column: customer_id. Normally, this query will work:

SELECT *
FROM customer NATURAL JOIN rental;
customer_idlast_updatestore_id

However, the result shows no rows at all. This is because, in this Pagila database, every table has a column named last_update. Thus, when NATURAL JOIN joins these 2 tables, it requires both customer_id and last_update be equal for 2 rows to be matched, and of course, no pair of rows satisfies this criterion.

Lesson learned: using NATURAL JOIN requires carefulness.

To solve this issue, we should exclude last_update from the 2 tables, so that rows will be matched using only the customer_id.

SELECT * 
FROM (
  (
SELECT customer_id, first_name, last_name
     , email, address_id, activebool
     , create_date, active
FROM customer
  ) a 
NATURAL JOIN
  (
SELECT rental_id, rental_date, inventory_id, customer_id
     , return_date, staff_id
FROM rental
  ) b
     );
customer_idfirst_namelast_name
459TOMMYCOLLAZO
408MANUELMURRELL
333ANDREWPURDY
222DELORESHANSEN
549NELSONCHRISTENSON
269CASSANDRAWALTERS
239MINNIEROMERO
126ELLENSIMPSON
399DANNYISOM
142APRILBURNS

CROSS JOIN

CROSS JOIN matches every row from one table to each row from the other, thus, if at the beginning, the tables have r_1 and r_2 rows, the result will contain (r_1 * r_2) rows in total.

Note that we do not specify any matching criteria for CROSS JOIN.

SQL's Cross Join
CROSS JOIN

Example 1:

Show each possible pair of manager id and staff id.

SELECT manager_staff_id, staff_id
FROM staff CROSS JOIN store;
manager_staff_idstaff_id
11
12
21
22

Example 2:

The requirement is the same as in Example 1, however, we will call the cross join in a different way.

SELECT manager_staff_id, staff_id
FROM staff, store;
manager_staff_idstaff_id
11
12
21
22

self-JOIN

Self-join is nothing more than joining a table with itself. The little difference between self-join and joining 2 different tables is that in a self-join, we have to make an alias for each instance of the joining table.

A self-join may be either [INNER] JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN or even CROSS JOIN.

Example 1:

Show all pairs of customers who have the same email address.

SELECT *
FROM customer c1 JOIN customer c2 ON c1.email=c2.email 
                                 AND c1.customer_id <> c2.customer_id; 
customer_idstore_idfirst_name

(The result indicates that there is no pair of customers who share the same email.)

References:

Leave a Reply