Relational Algebra

Goals

Concepts

Language

SQL

Lesson

You've created a database schema and you've populated your database with data. Now you will need to get data back out. But instead of retrieving the data in the form you put it in, you will need to query the database to answer questions about data relationships.

The data statements or Data Manipulation Language (DML) part of SQL allow it to function as a query language. You will use SQL to query the database just as you use it to modify the database. For an SQL query to work, it conceptually performs certain operations called the relational algebra. As with operations on numbers with numeric algebra, for example 7 - 5, relational algebra has various operators that are performed on the relations themselves to produce some result. (As you'll see below, relational algebra has a subtraction or “difference” operation, too.)

You learned that SQL is declarative, indicating what result should be obtained, not how to obtain it. Yet the relational algebra is a procedural sequence of operations! The database is not guaranteed to perform relational algebra in any particular sequence, but learning the relational algebra will help you understand what conceptual steps the database engine may take to produce the result of the query you specify in SQL.

You can think of each of the relational algebra operations as a function that is performed on a one or more relations. If a relational operation acts on a single relation, it is referred to as unary operation and is like a function with a single parameter. Likewise an operation taking two relations as inputs is called a binary operation. Regardless of the number of inputs, the output of a relational algebra operations is always a relation. Thus you can “chain” operations into entire expressions, just as you do with numeric algebra, as in 10 + 7 - 5.

Π: project()

Projecting the name column in the Supplier relation.
Supplier
id name address
1 Acme Furniture Company 123 Some Street
2 Bozo Toy Company 321 Other Street
3 Top Toys Corporation 456 Far Avenue

The project() operation returns a “vertical” subset of a relation by including only certain attributes. The Supplier relation shown has three attributes: id, name, and address. Invoking the project(Supplier, name) operation would return a relation consisting of all tuples, but only including the indicated column.

SELECT

In SQL the part of the SELECT statement that indicates a projection is the clause immediately after the SELECT statement itself. You already know that to list the contents of the Supplier table, you can issue the statement SELECT * from Supplier. To indicate you want to project one or more columns, you can simply list those columns, separated by commas, in place of the wildcard * character.

Projecting the name column from the Supplier table using SQL.
SELECT name FROM Supplier;
name
Acme Furniture Company
Bozo Toy Company
Top Toys Corporation

DISTINCT

SQL as you know works in terms of tables, which have some important differences from relations. In the relational algebra, all operations produce relations, which means that results can never have duplicate tuples. But in SQL, results are tables which may have duplicate rows! For example, if you issue SELECT address FROM Supplier, if two suppliers have the same address, the resulting table will contain two rows with the same value. You can ask SQL to remove any duplicate rows by adding DISTINCT to your statement, as in SELECT DISTINCT address FROM Supplier.

Of course SQL SELECT doesn't always return duplicates. Remember that the Supplier table has a unique constraint for its name column, as it was defined using name VARCHAR(9999) NOT NULL UNIQUE. Therefore the result of SELECT name from Supplier can never have duplicates; adding DISTINCT would in theory make no difference at all in the result.

ρ: rename()

Seeing that the result is a different table than the one being queried, you are free to specify a different name for the resulting column. SQL allows you to do this using AS column where column is the new name of the column. You can pick and choose; you don't have to rename all the columns in your query. The keyword AS is technically optional, but still a good idea to include it to make your queries clear!

Projecting and renaming the name column from the Supplier table using SQL.
SELECT name AS who, address AS where FROM Supplier;
who where
Acme Furniture Company 123 Some Street
Bozo Toy Company 321 Other Street
Top Toys Corporation 456 Far Avenue

σ: select()

Restricting the Item relation by stock < 15.
Item
code name stock supplierId
A1 table 20 1
B2 game table 12 1
C3 chair 15 1
D4 balloon 99 2
E5 kite 5 2
F6 doll 10 2

The relational select() operation is a way to filter the tuples of some relation based upon some criteria. While the project() operation returns a “vertical” subset of a relation, the select() operation returns a “horizontal” subset by “selectively” including or excluding each tuple.

The criteria for including a tuple is a Boolean expression called a predicate. You've seen this word “predicate” several times, when providing a lambda for filtering Java streams, for example. Only if the predicate evaluates to true for a tuple, is it included in the resulting relation. For example, performing select() on the Item relation with the predicate stock < 15 will only return three tuples, those with stock of 12, 5, and 10.

WHERE

To indicate that you want to be “selective”, the SQL SELECT statement allows an optional WHERE clause that contains the predicate for a select() operation. The following statement uses projection to choose the name and stock columns, and uses selection to include only certain rows.

Selecting rows from the Item table using SQL.
SELECT name, stock FROM Item WHERE stock < 15;
name stock
game table 12
kite 5
doll 10

Predicates

Operators in SQL and Java.
SQL Java Description
= == equal
<> != not equal
< < less than
<= <= less than or equal
> > greater than
>= >= greater than or equal
AND && logical AND
OR || logical OR
NOT ! logical NOT

The relational algebra select() operation acts as a filter, and its predicate is some expression that yields a Boolean result. The predicate of the SELECT … WHERE clause must follow the rules SQL has for expressions. SQL supports the most common equality operators, as shown in the table. SQL also supports three logical operators: AND, OR, and NOT. You can group your predicate expressions using parenthesis to ensure order of precedence, as you can in Java.

Although the expression in the WHERE clause produces a Boolean value, it is important to keep in mind that this an SQL Boolean. You may remember that the SQL Boolean type has three possible values: TRUE, FALSE, and UNKNOWN. If SQL encounters any value that is UNKNOWN, the result of the SQL expression will be neither TRUE nor FALSE, but UNKNOWN!

NULL

The most common situation that produces UNKNOWN is a comparison with NULL. This course has stressed that NULL can be used for various reasons: it might mean “absent”, it might mean “not applicable”, or it might mean “not yet known”. SQL takes the view that NULL indicates that a value is not currently known; by that logic, the outcome of any comparison to NULL will not be known either. Therefore if you compare any value to NULL (even NULL itself!) the result will be UNKNOWN!

Item table with a color column allowing NULL.
Item
code name color stock supplierId
A1 table black 20 1
B2 game table yellow 12 1
C3 chair blue 15 1
D4 balloon NULL 99 2
E5 kite red 5 2
F6 doll NULL 10 2

Let us say for example that the Item table has a color column that allows NULL. Balloons come in a big bag of assorted colors; it cannot be known ahead of time which color a balloon will be, so the value in the  color column for “balloon” is NULL.

Suppose you wanted to query for the names of all items that do not have a color specified. If you were to use the command SELECT name FROM Item WHERE color = NULL, no results would be returned. This is because NULL = NULL does not evaluate to TRUE.

But even stranger, if you wanted to find all items that do have a color by issuing the command SELECT name FROM Item WHERE color <> NULL, still no results would be returned! It doesn't matter what the value is—even NULL itself—the value will not be considered equal to NULL. It doesn't matter what color you compare with NULL; the result will be UNKNOWN.

Lastly even if the query itself does not mention NULL, the result can be unexpected if the predicate column contains NULL. To find all items that are not blue, if you were to issue the command SELECT name FROM Item WHERE color <> 'blue', the balloon will still not be included in the result! The balloon's recorded color is NULL, and NULL <> 'blue' does not evaluate to TRUE. In SQL's reasoning it is unknown what color the balloon is, so it is therefore unknown whether the balloon is of the color you ask for.

IS [NOT] NULL

If you want to include a check for NULL in an expression, you must use the IS NULL predicate, along with its negative predicate, IS NOT NULL. Here are the correct SQL commands for the queries discussed above.

The names of all items that do not specify a color.
SELECT name FROM Item WHERE color IS NULL;
The names of all items that specify a color.
SELECT name FROM Item WHERE color IS NOT NULL;
The names of all items that are not blue.
SELECT name FROM Item WHERE color <> 'blue' OR IS NULL;
IS [NOT] DISTINCT FROM

Because NULL is “not equal to itself” but still “not distinct from itself”, SQL:1999 together with SQL:2003 added added an IS [NOT] DISTINCT FROM predicate. This allows more natural comparisons with NULL.

The names of all items that are not blue.
SELECT name FROM Item WHERE color IS DISTINCT FROM 'blue';
NOT

TODO; mention IS NOT NULL above

IN

When you have multiple potential values to check for, it becomes cumbersome to use a series of expressions combined by OR. For example, if you wanted to find all the items that are either red, green, or blue, you could add the clause WHERE color = 'red' OR color = 'green' OR color = 'blue'. SQL offers a predicate named IN that makes it easier to check of a value is one of several values without providing an explicit comparison for each one.

The predicate IN evaluates to TRUE of the given column or value is found in a set of values. You provide a comma-separate list of values within parentheses after IN as shown the example below, using the expanded Item table above. In a future lesson, you'll learn how to use IN with values retrieved from a completely different query.

Selecting rows from the Item table that match any of several colors.
SELECT name, color FROM Item WHERE color IN ('red', 'green', 'blue');
name color
chair blue
kite red
[NOT] BETWEEN

TODO

∪: Union

The union() operation comes straight out of set theory, which is appropriate because relations are sets of tuples. As with sets, union() produces a new relation with all the tuples from two input relations. As the result is itself a relation, it will have no duplicate tuples, even if the same tuple existed in both of the original relations. In terms of Java, this is essentially the same as making a copy of a Set<Tuple> and adding another to it using new HashSet<Tuple>(set1).add(set2).

UNION

The corresponding SQL operator is appropriately named UNION, and is placed between two result table to create a new result table. Unfortunately you cannot simply place the keywords UNION between two table names. You must actually perform two queries using SELECT and place the UNION keyword between them. For example you could query all items with stock under 10, separately query all items with stock over 20, and then combine the results into one table.

Producing the union of two result tables in SQL.
SELECT name, stock FROM Item WHERE stock < 10
UNION
SELECT name, stock FROM Item WHERE stock > 20;
name stock
balloon 99
kite 5

−: Difference

The difference() operation is similar to the subtraction operation in elementary algebra, and it even uses the minus sign as its symbol. The difference of two relations is the tuples of the first relation, minus any of those tuples that also appear in the second relation. In terms of Java, this is essentially the same as making a copy of Set<Tuple> and removing everything that appears in a separate set using new HashSet<Tuple>(set1).removeAll(set2).

EXCEPT

SQL indicates the difference() operation using the EXCEPT operator, and the order appears as if you were writing a subtraction operation in elementary algebra. As with UNION, you perform the operation on the result of two queries, not directly on two tables.

Finding the difference of two result tables in SQL.
SELECT name, stock FROM Item WHERE stock < 20
EXCEPT
SELECT name, stock FROM Item WHERE stock < 10;
name stock
game table 12
chair 15
doll 10

∩: Intersection

Just as AND is the complement to OR in logic, intersection() is the complement to union() in the relational algebra. While union() produces a relation with tuples appearing in either of the original relations, the intersection() operation produces a relation only with tuples that appear in both of the original relations. In terms of Java, this is essentially the same as making a copy of a Set<Tuple> and removing everything except the content that appears in another set using new HashSet<Tuple>(set1).retainAll(set2).

INTERSECT

SQL provides the INTERSECT operator to indicate the intersection() operation. The form parallels that of UNION.

Producing the intersection of two result tables in SQL.
SELECT name, stock FROM Item WHERE stock > 10
INTERSECT
SELECT name, stock FROM Item WHERE stock < 20;
name stock
game table 12
chair 15

×: Cartesian Product

Another relational operation that comes directly from set theory is the Cartesian product, or simply “product”, indicated by the × symbol of two crossing bars. You should already be familiar with this word and symbol from multiplication in elementary algebra. Multiplying two numbers, for example 2 × 3, produces the product 6.

The notions of multiplication and Cartesian product are intimately linked. Imagine one set containing two objects {A, B}, along with a set containing three objects {X, Y, Z}. To produce Cartesian product of these two sets, take each object from the first set and pair it with each object in the second set: {AX, AY, AZ, BX, BY, BZ}. You'll notice that the number of combinations or cardinality of the resulting set, 6, is the same as the product of the cardinality of each of the original sets 2 × 3.

Remember that a relation is a set of tuples. The product() relational operation produces a new relation by paring each of the tuples in the first relation with each of the tuples in the second relation. Remembering that the heading of a relation is a set of attributes, the heading of the resulting relation will be the union of the attributes of the original relations.

Consider an Item relation with the heading {itemName, stock} and the values {{"table", 20}, {"kite", 5}}; and a Supplier relation with the heading {supplierName, address} and the values {{"Acme", "Some"}, {"Bozo", "Other"}}. The product(Item, Supplier) (or in symbolic notation Item × Supplier) would produce a relation with the heading {itemName, stock, supplierName, address}. It would contain the tuples {{"table", 20, "Acme", "Some"}, {"table", 20, "Bozo", "Other"}, {"kite", 5, "Acme", "Some"}, {"kite", 5, "Bozo", "Other"}}.

FROM

Abridged Item and Supplier tables.
Item
code name stock supplierId
A1 table 20 1
C3 chair 15 1
E5 kite 5 2
Supplier
id name address
1 Acme Furniture Company 123 Some Street
2 Bozo Toy Company 321 Other Street
3 Top Toys Corporation 456 Far Avenue

SQL produces the Cartesian product of two tables automatically if more than one table appears in the FROM clause. Consider an abridged version of the Item and Supplier tables from previous lessons. You can produce the Cartesian product of these two tables by using FROM Item, Supplier as shown below.

Each row in the Item table, which has three rows, will be paired with each row in the Supplier table, which also has three rows. The resulting table will therefore have 3 × 3 = 9 rows. The number of columns will be the combined number of columns in the original tables: 4 + 3 = 7.

Producing the Cartesian product of the Item and Supplier tables using SQL.
SELECT * FROM Item, Supplier;
code name stock supplierId id name address
A1 table 20 1 1 Acme Furniture Company 123 Some Street
A1 table 20 1 2 Bozo Toy Company 321 Other Street
A1 table 20 1 3 Top Toys Corporation 456 Far Avenue
C3 chair 15 1 1 Acme Furniture Company 123 Some Street
C3 chair 15 1 2 Bozo Toy Company 321 Other Street
C3 chair 15 1 3 Top Toys Corporation 456 Far Avenue
E5 kite 5 2 1 Acme Furniture Company 123 Some Street
E5 kite 5 2 2 Bozo Toy Company 321 Other Street
E5 kite 5 2 3 Top Toys Corporation 456 Far Avenue

The two tables in this example already have a foreign key / primary key relationship, and so a Cartesian product of the two tables make little sense, because the resulting table contains every possible combination of Item and Supplier. The “chair” row in the original Item table has a supplierId of 1, which indicates Acme Furniture Company. The table the Cartesian product produces indeed contains a row pairing “chair” with Acme Furniture company, but it also has a row pairing “chair” with Bozo Toy Company, and another one pairing “chair” with Top Toys Company.

Restricting the Cartesian product of the Item and Supplier tables using SQL.
SELECT * FROM Item, Supplier
  WHERE Item.supplierId = Supplier.id;
code name stock supplierId id name address
A1 table 20 1 1 Acme Furniture Company 123 Some Street
C3 chair 15 1 1 Acme Furniture Company 123 Some Street
E5 kite 5 2 2 Bozo Toy Company 321 Other Street

It would be more useful in this case if you could filter the results to return only those pairs that belong together. That is easily done using relational select() operation with a predicate ensuring that the foreign key of one table matches the primary key of the other table. In SQL add a predicate using WHERE Item.supplierId = Supplier.id as shown.

Joins

The sequence of relational operations that is select(product()) is therefore a way to pair tuples from two relations based upon a foreign key link between them. This functionality is so essential to the relational model that the relational algebra has a special operation for it: the join() operation.

To perform a join in SQL, place the keyword JOIN between two table names instead of using a comma. Then after the listing the two tables, use the keyword ON to introduce the predicate.

Performing a join using SQL.
SELECT * FROM Item, Supplier
  WHERE Item.supplierId = Supplier.id;
SELECT * FROM Item JOIN Supplier
  ON Item.supplierId = Supplier.id;
code name stock supplierId id name address
A1 table 20 1 1 Acme Furniture Company 123 Some Street
C3 chair 15 1 1 Acme Furniture Company 123 Some Street
E5 kite 5 2 2 Bozo Toy Company 321 Other Street

Although JOIN includes a join predicate using ON you can still provide additional filter predicates unrelated to the join itself. You could for example query for all items, with their suppliers, that have a stock of more than 10 items.

Performing a JOIN in SQL with additional, non-join predicate.
SELECT Item.name as item, stock, Supplier.name as supplier
  FROM Item JOIN Supplier ON Item.supplierId = Supplier.id
  WHERE stock > 10;
item stock name
table 20 Acme Furniture Company
chair 15 Acme Furniture Company

Inner Join ⋈

An inner join is the type of join you just learned, equivalent to filtering the rows produced by cross join by using some predicate expression. This is the default type of join, represented in the relational algebra by the “bowtie” ⋈ symbol, and the type of join SQL performs if you only indicate the keyword JOIN in the query. SQL also allows you to explicitly indicate INNER JOIN, which means the same thing as JOIN by itself.

Performing an inner join using SQL.
SELECT * FROM Item INNER JOIN Supplier
  ON Item.supplierId = Supplier.id;
code name stock supplierId id name address
A1 table 20 1 1 Acme Furniture Company 123 Some Street
C3 chair 15 1 1 Acme Furniture Company 123 Some Street
E5 kite 5 2 2 Bozo Toy Company 321 Other Street

Outer Joins: Left ⟕, Right, ⟖ and Full ⟗

You might have noticed that the supplier Top Toys Company does not appear in the result table from the above inner join. This is because no a single item specified Top Toys Corporation as its supplier. An inner join will discard any rows from either table that do not match the join predicate, as you might expect from a selection on a Cartesian product.

Performing an outer join using SQL.
SELECT * FROM Item RIGHT OUTER JOIN Supplier
  ON Item.supplierId = Supplier.id;
code name stock supplierId id name address
A1 table 20 1 1 Acme Furniture Company 123 Some Street
C3 chair 15 1 1 Acme Furniture Company 123 Some Street
E5 kite 5 2 2 Bozo Toy Company 321 Other Street
NULL NULL NULL NULL 3 Top Toys Corporation 456 Far Avenue

An outer join will add back the remaining rows on one or both side of the query that did not match the join predicate. In SQL specifying LEFT OUTER JOIN will add back the unmatched rows from the table on the left side of the join, while indicating RIGHT OUTER JOIN will add back unmatched rows from the table on the right side. SQL also allows FULL OUTER JOIN, which brings back unmatched rows from each side—effectively the union of LEFT OUTER JOIN and RIGHT OUTER JOIN.

Because by definition the rows from the table on one side of the query did not match rows in the other side, the joined row will contain NULL for the values representing the unmatched side. In the example shown, a right outer join was used to include Top Toys Corporation even though no items are associated with that supplier. Because there is was no supplier match, all supplier values for that row are set to NULL.

TODO full outer join

Cross Join ×

The cross join is really just another word for the Cartesian product relational algebra operation, indicated appropriately by the crossing bars × symbol. Although you can perform a Cartesian product merely by listing multiple tables, it is better to use the newer join syntax, which explicitly indicates the type of join being performed. To indicate a Cartesian product using the join syntax, place the keywords CROSS JOIN between the table names.

Performing an SQL cross join of the Item and Supplier tables.
SELECT * FROM Item, Supplier;
SELECT * FROM Item CROSS JOIN Supplier;
code name stock supplierId id name address
A1 table 20 1 1 Acme Furniture Company 123 Some Street
A1 table 20 1 2 Bozo Toy Company 321 Other Street
A1 table 20 1 3 Top Toys Corporation 456 Far Avenue
C3 chair 15 1 1 Acme Furniture Company 123 Some Street
C3 chair 15 1 2 Bozo Toy Company 321 Other Street
C3 chair 15 1 3 Top Toys Corporation 456 Far Avenue
E5 kite 5 2 1 Acme Furniture Company 123 Some Street
E5 kite 5 2 2 Bozo Toy Company 321 Other Street
E5 kite 5 2 3 Top Toys Corporation 456 Far Avenue

TODO mention self-joins in a note

Review

Gotchas

In the Real World

Think About It

Self Evaluation

Task

Create the following queries and store them each in an .sql file in the data/ directory.

See Also

References

Resources

Acknowledgments