3.2. Relational algebra¶
In the last chapter, we introduced the relational model of the database, and defined the fundamental mathematical object in the model, the relation. In this chapter, we discuss relational algebra, which is the set of algebraic operations that can be performed on relations. Relational algebra can be viewed as one mechanism for expressing queries on data stored in relations, and an understanding of relational algebra is important in understanding how relational databases represent and optimize queries. We will cover only basic relational algebra, excluding later extensions such as those for group and aggregate operations and those for outer joins.
A related topic, which we do not cover in this book, is relational calculus. Relational calculus provides another mathematical expression of queries on relations, and is equivalent in expressiveness to relational algebra.
3.2.1. Unary operations¶
The unary operations in relational algebra act on one relation and result in another relation. The unary operations are selection, projection, and renaming, and their associated operators are typically written as the Greek letters which match the starting letters of the operation:
\(\sigma\) (sigma): selection
\(\pi\) (pi): projection
\(\rho\) (rho): renaming
We will explore each of these unary operators in application to the relation books shown below:
book_id 
author_id 
title 
year 

1 
3 
The House of the Spirits 
1982 
2 
1 
Invisible Man 
1952 
3 
6 
The Hobbit 
1937 
4 
2 
Unaccustomed Earth 
2008 
5 
6 
The Fellowship of the Ring 
1954 
6 
4 
House Made of Dawn 
1968 
7 
5 
A Wizard of Earthsea 
1968 
The books relation has primary key book_id, while author_id is a foreign key to another table we will use later in this chapter.
3.2.1.1. Selection¶
Selection applies a Boolean condition to the tuples in a relation. The result of a selection operation is a relation containing exactly those tuples for which the selection condition is true. For example, if we are interested in books published after 1960, we can write the selection operation to retrieve just those books as:
The operator is written with the Boolean condition as a subscript, and then the operand (the input relation) is given in parentheses. Note that the Boolean condition refers to an attribute of the books relation, comparing it to a constant value. The result of this operation is a relation with the same schema as books, but with no name:
book_id 
author_id 
title 
year 

1 
3 
The House of the Spirits 
1982 
4 
2 
Unaccustomed Earth 
2008 
6 
4 
House Made of Dawn 
1968 
7 
5 
A Wizard of Earthsea 
1968 
Simple Boolean expressions in the relational algebra usually involve comparisons of an attribute with a constant, using any comparison operator. More complex Boolean expressions can be constructed from simple expressions using AND, OR, and NOT. For instance, if we are interested in books published after 1960 as well as books by the author with author_id equal to 6, we could write:
Selection can result in a relation that has all of the tuples from the original (a relation equivalent to the original), some of the tuples from the original, or no tuples at all (an empty set). In the case of an empty relation, we still consider the relation to have the same schema as the original relation.
Since the result of a selection is a relation, we can apply another selection to the result. For example, we could find the books published after 1950, and then select from that result the books with author_id equal to 6:
This would give us a result with one tuple:
book_id 
author_id 
title 
year 

5 
6 
The Fellowship of the Ring 
1954 
This composition of selection operations is equivalent to a single selection operation using a conjunction (AND) of the selection conditions:
3.2.1.2. Projection¶
The projection operation creates a new relation which has a subset of the attributes of the input relation. We could use projection, for example, to get a set of tuples expressing just the title and publication year of our books. We write the projection operator with the list of attribute names in the subscript, followed by the operand in parentheses:
This result contains a tuple for each tuple in books, but the tuples in the result only have the attributes specified by the projection operation, thus the result relation has a different schema from the original:
title 
year 

The House of the Spirits 
1982 
Invisible Man 
1952 
The Hobbit 
1937 
Unaccustomed Earth 
2008 
The Fellowship of the Ring 
1954 
House Made of Dawn 
1968 
A Wizard of Earthsea 
1968 
At first glance, it might seem the result of a projection will always have the same number of tuples as the input relation, but this is not the case. Consider what happens if we project books onto the single attribute year. There are two tuples in books with the same year value of 1968. Since relations cannot contain duplicates, the result of our projection operation can contain only one tuple with year equal to 1968. Thus, the result has fewer tuples than the input relation:
year 

1982 
1952 
1937 
2008 
1954 
1968 
Since the result of projection is a relation, we can apply selection to the result:
Note the order of operations here: first, we supply books as an input to the projection operation; second, the result of the projection is given as the input to the selection operation.
Similarly, since the result of a selection is a relation, we can apply projection after selection. The above expression is equivalent to:
The result in both cases is:
title 
year 

House Made of Dawn 
1968 
A Wizard of Earthsea 
1968 
It is important to note, however, that you cannot always change the order of projection and selection for an equivalent result. Consider the following expressions:
In the first expression, we select the books which were published in 1968, and then project the resulting tuples onto the title attribute. This result is:
title 

House Made of Dawn 
A Wizard of Earthsea 
However, the second expression is not a correct expression. The projection occurs first, yielding a relation with just one attribute named title. The following selection is then incorrect, because it makes reference to an attribute, year, which does not exist in the input relation.
Projection can also be applied to the result of another projection; however, the result is equivalent to just performing the second projection. Compare:
Note that we cannot change the order of the two projection operations in the first expression above, as the expression would then be incorrect.
3.2.1.3. Renaming¶
The final unary operation allows for relations and their attributes to be renamed. As we will see, this operation is primarily useful in eliminating name conflicts in certain binary operations  that is, in expressions involving two relations in which the name of some attribute is the same in both relations. The general form of the renaming operator lets us provide new names for the relation and all of its attributes:
This results in a relation with the name mybooks with attributes b_id, a_id, title, and year. The tuples of the new relation have the same values as the tuples of the old relation, but the values are associated with the new attribute names.
As in this example, it is not necessary to alter the name of every attribute (we left unchanged the attribute names title and year), but some name must be provided for every attribute. A nonstandard alternative notation allows us to rename only the attributes we want to change:
We can optionally leave out either the relation name or the list of attributes. For example, the following expression is correct and results in a relation named books with attributes book_id, author_id, title, and publication_year:
3.2.2. Cross products and joins¶
We now turn our attention to operations which extend tuples in one relation with tuples from another relation. For this section, we will be using books and a second relation, authors:
author_id 
name 
birth 
death 

1 
Ralph Ellison 
19140301 
19940416 
2 
Jhumpa Lahiri 
19670711 

3 
Isabel Allende 
19420802 

4 
N. Scott Momaday 
19340227 

5 
Ursula K. Le Guin 
19291021 
20180122 
6 
J.R.R. Tolkien 
18920103 
19730902 
7 
Kazuo Ishiguro 
19541108 
The authors relation has a primary key of author_id. The books relation is related to authors via a foreign key on author_id.
3.2.2.1. Cross product¶
The cross product (or Cartesian product) of two relations A and B is a new relation containing all tuples that can be created by concatenating some tuple from B onto some tuple from A [1]. Here we are using the definition of tuple as an ordered list of values. The attributes of the new relation are normally the attributes of A and B concatenated. However, if there is a name collision, e.g., if both A and B have some attribute x, we will disambiguate the attributes in the new relation by prepending the relation names, that is, the cross product will have attributes A.x and B.x; we can avoid having to do this if we first apply renaming to one relation or the other.
The cross product operator is denoted \(\times\), and is written between its two operands. To start, consider two rather abstract relations S and T:
u 
v 

1 
one 
2 
two 
x 
y 
z 

green 
3.1415 
apple 
blue 
2.71828 
pear 
yellow 
1.618 
mango 
We write the cross product of S and T as:
which gives us the relation containing every pairing of a tuple from S with every tuple from T:
u 
v 
x 
y 
z 

1 
one 
green 
3.1415 
apple 
1 
one 
blue 
2.71828 
pear 
1 
one 
yellow 
1.618 
mango 
2 
two 
green 
3.1415 
apple 
2 
two 
blue 
2.71828 
pear 
2 
two 
yellow 
1.618 
mango 
From the definition, it is trivial to determine that the size of the cross product is the product of the sizes of the operands.
3.2.2.2. Join¶
The cross product is a fundamental operation in relational algebra, but not a generally useful one when we consider actual data. Consider the cross product of books and authors:
The full set of tuples in this relation is large (the number of books multiplied by the number of authors), so we only show a subset below:
book_id 
books.author_id 
title 
year 
authors.author_id 
name 
birth 
death 

1 
3 
The House of the Spirits 
1982 
1 
Ralph Ellison 
19140301 
19940416 
1 
3 
The House of the Spirits 
1982 
2 
Jhumpa Lahiri 
19670711 

1 
3 
The House of the Spirits 
1982 
3 
Isabel Allende 
19420802 

2 
1 
Invisible Man 
1952 
1 
Ralph Ellison 
19140301 
19940416 
2 
1 
Invisible Man 
1952 
2 
Jhumpa Lahiri 
19670711 

2 
1 
Invisible Man 
1952 
3 
Isabel Allende 
19420802 
The author of The House of the Spirits is Isabel Allende. What meaning, then, can we make of a tuple that pairs The House of the Spirits with the author Ralph Ellison (the author of Invisible Man)?
We are typically interested in pairing only certain tuples of a relation with certain tuples of another. In the above example, we are interested in tuples where the author_id attribute from books agrees with the author_id attribute from authors. This relationship is indicated not only by the names we have used for attributes, but also by the foreign key constraint on books and authors. To retain only the tuples with matching author_id values, we can apply a selection operation to the result of our cross product:
This yields a useful result:
book_id 
books.author_id 
title 
year 
authors.author_id 
name 
birth 
death 

1 
3 
The House of the Spirits 
1982 
3 
Isabel Allende 
19420802 

2 
1 
Invisible Man 
1952 
1 
Ralph Ellison 
19140301 
19940416 
3 
6 
The Hobbit 
1937 
6 
J.R.R. Tolkien 
18920103 
19730902 
4 
2 
Unaccustomed Earth 
2008 
2 
Jhumpa Lahiri 
19670711 

5 
6 
The Fellowship of the Ring 
1954 
6 
J.R.R. Tolkien 
18920103 
19730902 
6 
4 
House Made of Dawn 
1968 
4 
N. Scott Momaday 
19340227 

7 
5 
A Wizard of Earthsea 
1968 
5 
Ursula K. Le Guin 
19291021 
20180122 
Since this pattern of applying a selection after a cross product is so common, we have an operator that combines the two into an operation known as a join [2]. Using the join operator, the above expression becomes:
or, you can instead format the expression as:
Note that one tuple from authors does not contribute to the join. This tuple’s author_id matches none of the tuples in books, and thus no combined tuple using it can appear in the join result. We call this tuple a dangling tuple. Dangling tuples may be an indication of a problem in the data; in this example, it may suggest that we are missing information about books by one author.
3.2.2.3. Thetajoin and equijoin¶
While an equality condition is typically used in joins, more generally any condition of the following form can be used:
where A.x is an attribute from one relation, B.y is an attribute from the other relation, and \(\Theta\) is a comparison operator (such as =, <, etc.). A condition of this form is known as a theta condition, and a join using such a condition or a conjunction (AND) of such conditions is known as a thetajoin.
A thetajoin using only equality comparisons (as in our example above) is further known as an equijoin.
This terminology is not especially important in understanding the algebra, but is something you may encounter if you intend a deeper study of relational algebra.
3.2.2.4. Natural join¶
When we join books with authors we run into the issue that both relations contain an attribute named author_id. Since a relation cannot have more than one attribute with the same name, joining (or taking a cross product of) these two relations requires us to rename the attributes in some fashion. This can be done either by an explicit renaming operation prior to joining or by prepending the original relation name (as we did in our example). Because our join condition was equality on the author_id attributes, both the books.author_id and authors.author_id in the resulting relation always agree. This unnecessary redundancy can be removed using projection and renaming.
In this special situation in which we wish to join specifically by equating the attributes with the same names in both relations  subsequently removing the “duplicate” attributes  we can instead do a natural join. We can indicate a natural join using the join operator with no conditions [3]:
which yields the simplified relation:
book_id 
author_id 
title 
year 
name 
birth 
death 

1 
3 
The House of the Spirits 
1982 
Isabel Allende 
19420802 

2 
1 
Invisible Man 
1952 
Ralph Ellison 
19140301 
19940416 
3 
6 
The Hobbit 
1937 
J.R.R. Tolkien 
18920103 
19730902 
4 
2 
Unaccustomed Earth 
2008 
Jhumpa Lahiri 
19670711 

5 
6 
The Fellowship of the Ring 
1954 
J.R.R. Tolkien 
18920103 
19730902 
6 
4 
House Made of Dawn 
1968 
N. Scott Momaday 
19340227 

7 
5 
A Wizard of Earthsea 
1968 
Ursula K. Le Guin 
19291021 
20180122 
3.2.3. Set operations¶
Unsurprisingly, given that relations are sets, relational algebra includes the usual set operations  union, intersection, and set difference  with some restrictions. These binary operations are denoted by:
\(\cup\): union
\(\cap\): intersection
\(\): set difference
Given two relations A and B, the union \(\text{A} \cup \text{B}\) is the set of all tuples that exist in A, or exist in B, or both. The intersection \(\text{A} \cap \text{B}\) is the set of all tuples that exist in both A and B. Finally, the set difference \(\text{A}  \text{B}\) is the set of all tuples that exist in A but do not exist in B.
For example, let A and B be the relations below:
x 
y 

apple 
42 
orange 
19 
cherry 
77 
x 
y 

banana 
8 
apple 
42 
coconut 
17 
Then we have:
x 
y 

apple 
42 
orange 
19 
cherry 
77 
banana 
8 
coconut 
17 
x 
y 

apple 
42 
x 
y 

orange 
19 
cherry 
77 
x 
y 

banana 
8 
coconut 
17 
Note that union and intersection are commutative, but set difference is not.
The important restriction on set operations in relational algebra is that the relations must be compatible in terms of their schemas. The meaning of “compatible” varies, but for our purposes, assume we view the tuples in a relation as ordered lists, where each position in the list is associated with a particular attribute and type domain. Then, if we have two relations, we require that, for a given position in the tuples in either relation, the attribute and type domain are the same. For A and B shown above, we might assert that the first position corresponds to attribute x and contains character strings, while the second position (y) contains integers.
A looser requirement allows attribute names (but not type domains) to differ between relations. This requirement aligns less closely with the second definition of tuple given in the previous chapter, but it eliminates the occasional need for renaming operations prior to applying set operations. If the attribute names do not match in the two relations, we adopt the attribute names from the lefthand operand for the result relation.
While intersection is a useful operation, it is not strictly needed for the algebra, as the same result can be obtained using set difference:
3.2.4. Division¶
The operations described above are sufficient for most query needs. However, one other binary operation, division, is typically included in the basic relational algebra. To divide a relation P by another relation R, we write:
Division is the most difficult operation to describe; in a very loose sense it acts as a kind of inverse to a cross product. That is, if P, Q, and R are relations and
then it is true that
However, the reverse is not necessarily true. Rather, let P be some relation, with attributes x and y [4]. We require that R has attribute y. Then \(\text{P} \div \text{R}\) will contain the values of x which are paired (in P) with every value of y listed in R.
We will start with an abstract example. Let P be the relation pictured below:
x 
y 

1 
blue 
1 
green 
1 
yellow 
2 
blue 
2 
yellow 
3 
blue 
3 
green 
3 
yellow 
3 
red 
Let R be
y 

blue 
green 
yellow 
Then \(\text{Q} = \text{P} \div \text{R}\) is
x 

1 
3 
because only the values 1 and 3 are paired with blue, green, and yellow in P. The value 2 is not paired with green, so it does not appear in the quotient. The value 3 is also paired with red, but red is not in R and thus does not affect the result.
For a more tangible example, consider the following relation, named authors_awards:
author 
award 

Ralph Ellison 
National Book Award 
Jhumpa Lahiri 
Pulitzer Prize for Fiction 
N. Scott Momaday 
Pulitzer Prize for Fiction 
Ursula K. Le Guin 
Hugo Award 
Ursula K. Le Guin 
Nebula Award 
C. J. Cherryh 
Hugo Award 
Kazuo Ishiguro 
Booker Prize 
Kazuo Ishiguro 
Nobel Prize in Literature 
Michael Chabon 
Hugo Award 
Michael Chabon 
Nebula Award 
Michael Chabon 
Pulitzer Prize for Fiction 
and the relation science_fiction_awards:
award 

Hugo Award 
Nebula Award 
We might ask the question, “Which authors have received all of the science fiction book awards?” The answer is given by
author 

Ursula K. Le Guin 
Michael Chabon 
Like the join and set intersection operations, division can be accomplished using other relational algebra operations; however, the construction is fairly complex. If we have relation P with attributes x and y, and relation R with attribute y, then
By carefully applying the righthand side expression above to one of our examples, you can verify that the desired result is obtained, but the basic intuition is that we must first find the values of x in P which are not paired (in P) with one or more y values listed in R, and then subtract that list of x values from the list of all x values in P:
Create a relation containing every x value in P paired with every y value in R:
Subtract (using set difference) P from the cross product result above. These are the possible pairings of x (in P) and y (in R) that do not exist in P:
Project the last result onto attribute x. These are the x values that are not paired with some value from R:
Subtract the last result from the set of all x values in P for the final solution:
3.2.5. Queries¶
As we have seen, the operations of the relational algebra act on relations and result in relations, and thus we can apply relational operations sequentially to obtain a final desired result. With the operations we have discussed, we can express a very wide array of queries (questions to be answered by the data). We have seen examples of simple queries throughout this chapter, mostly involving one or two basic operations.
Even simple questions, however, can require the application of multiple operations. Consider the question, “What books by J.R.R. Tolkien were published after 1950?”. This is similar to a question we asked earlier, using the author ID value rather than the author’s name. With only the author’s name, we have to do a bit more work.
There are many ways to get to our desired result. One possible approach might begin with the conditions presented: the author is J.R.R. Tolkien, and the publication year is greater than 1950. Author names are in the authors relation, while publication years are in the books relation. So we might guess we need two selection operations, one on each relation:
and
This gives us two relations which are related by the author_id attribute present in both. So a natural join might be our next step:
Finally, we are only interested in the book titles (or possibly titles and publication years), so we finish with a projection operation:
This is only one of many possible expressions that yield identical results. Here are some equivalent expressions:
3.2.5.1. Operation sequences¶
As queries become more complex, expressions like the ones shown above can become quite long and difficult to understand. An alternative approach is to use intermediate variables to decompose and label the parts of our expression. The result is a more sequential view of the operations.
We will demonstrate this approach with one of the queries from the last section:
Using variables, we can write this as a sequence of operations:
with R holding our final result.
3.2.5.2. Expression trees¶
Another representation of relational algebra expressions is in the form of a tree. Expression trees are a useful visual representation of a query.
We will make use of them again in Chapter XXX, which is concerned with how database software considers different action plans for executing a query.
We will again demonstrate using the query:
The tree representation of this query looks like:
Operations start at the bottom of the tree, with the relations authors and books, and proceed upwards. We can apply either selection operation first, then the other; both must be applied before we can perform the join, and we finish with the projection.
Here is another example, corresponding to the expression:
The tree is:
❏
Notes