Appendix A: Example datasets used in this book

This appendix contains descriptions of the tables in the textbook database as well as SQL scripts and data files that can be used to recreate the book’s database in other database systems.

Dataset descriptions

In this section we provide descriptions of each dataset or group of related tables. The tables are described in detail in tabular format. Entity-relationship diagrams (ERDs) and logical models using crow’s foot notation (as described in Part 2) are also provided for some datasets. Finally, each dataset section contains an interactive query tool set up to retrieve the data from all of the tables in the dataset (note that the interactive query tool limits results to 100 rows, and some tables have more than that number of rows).

The simple books dataset

The simple books dataset consists of the tables simple_books and simple_authors. The simple_books table contains data about 12 books, each by a different author, while the simple_authors table contains data on each of the 12 authors appearing in simple_books.

This dataset is used in the early chapters of Part 1 of the text to introduce SQL, and is not intended as an example of good database design. The rows in simple_books are implicitly identified by unique book titles, and the rows in simple_authors are implicitly identified by author names, but neither table has a primary key. Similarly, the rows in simple_books are in one-to-one correspondence with the rows in simple_authors, however, there is no explicit relationship between the tables via a foreign key constraint.

The simple_books table records works of fiction, non-fiction, poetry, etc. by a single author.

column

type

description

author

character string

the book’s author

title

character string

the book’s title

publication_year

integer

year the book was first published

genre

character string

a genre describing the book

The simple_authors table records persons who have authored books.

column

type

description

name

character string

full name of the author

birth

date

birth date of the author, if known

death

date

death date of the author, if known

Use the query tool below to view the data.

To see the CREATE TABLE code for a table, you can use the query tool above to query the sqlite_master table, e.g.:

SELECT sql FROM sqlite_master WHERE name = 'simple_books';

The expanded books dataset

The expanded books dataset includes information on 200 books (in the books table) and their authors (76 authors in total in the authors table). An awards table lists some major awards given to authors for their body of work or for specific books. Two cross-reference tables associate authors with their awards (authors_awards) and specific books with their awards (books_awards). Finally, the editions table contains publication data related to just four books by author J.R.R. Tolkien. (Only four books were included to reduce the size of the database file, which must be downloaded and held in the web browser’s memory.) The publication data is particularly “dirty” in the sense that it contains numerous inaccuracies, omissions, and redundancies; see the section title Data collection notes, below.

This dataset is introduced in Chapter 1.4 and is used throughout much of the rest of Part 1 of the text. While the dataset represents a highly simplified model of books (e.g., books are assumed to always have a single author), the schema attempts to emulate best practices in database design while illustrating fundamental SQL database concepts. The design makes use of primary key constraints (using synthetic unique identifiers), foreign key constraints, and cross-reference tables implementing many-to-many relationships.

We provide below detailed descriptions of each table in the dataset, an entity-relationship diagram (ERD) modeling the data, and a logical model in crow’s foot notation showing the tables and their relationships.

The authors table records persons who have authored books. Every author corresponds to at least one book in the database.

column

type

description

author_id

integer

unique identifier for the author

name

character string

full name of the author

birth

date

birth date of the author, if known

death

date

death date of the author, if known

The books table records works of fiction, non-fiction, poetry, etc. by a single author. Each book corresponds to a single author from the authors table, and may correspond to many editions of the book listed in the editions table.

column

type

description

book_id

integer

unique identifier for the book

author_id

integer

author_id of book’s author from authors table

title

character string

the book’s title

publication_year

integer

year the book was first published

The editions table records specific publications of a book. Each edition corresponds to a single book from the books table. For space reasons, the editions table only includes data on four books by J.R.R. Tolkien.

column

type

description

edition_id

integer

unique identifier for the edition

book_id

integer

book_id of the book (from books table) published as this edition

publication_year

integer

year this edition was published

publisher

character string

name of the publisher

publisher_location

character string

city or other location(s) where the publisher is located

title

character string

title this edition was published under

pages

integer

number of pages in this edition

isbn10

character string

10-digit international standard book number

isbn13

character string

13-digit international standard book number

The awards table records various author and/or book awards.

column

type

description

award_id

integer

unique identifier for the award

name

character string

name of the award

sponsor

character string

name of the organization giving the award

criteria

character string

what the award is given for

The authors_awards table is a cross-reference table (explained in Chapter 1.4) relating authors and awards; each entry in the table records the giving of an award to an author (not for any particular book) in a particular year.

column

type

description

author_id

integer

author_id of the author receiving the award

award_id

integer

award_id of the award received

year

integer

year the award was given

The books_awards table is a cross-reference table relating books and awards; each entry in the table records the giving of an award to an author for a specific book in a particular year.

column

type

description

book_id

integer

book_id of the book for which the award was given

award_id

integer

award_id of the award given

year

integer

year the award was given

Here is the data model for the expanded books dataset, as an ERD:

A data model of the books dataset given as an entity-relationship diagram.

A logical model of the expanded books dataset is shown below. In this crow’s foot diagram, primary keys are shown underlined and in boldface, while foreign keys are italicized.

A crow's foot diagram showing the logical model of the books dataset.

Use the query tool below to view the expanded books data. Note that the query tool limits results to 100 rows, but the books and editions tables have more than 100 rows each.

The bookstore dataset

The bookstore dataset consists of the tables bookstore_inventory and bookstore_sales. These tables simulate a very simple database that a seller of used books might use. The bookstore_inventory table represents books that the bookseller is selling. The table actually includes some books that have already been sold - when a book is sold, an entry is added to the companion table, bookstore_sales. The bookseller thus has to look at both tables to determine which books are actually available to be sold. The bookseller may choose to periodically remove old records from the tables.

While students should feel free to modify any data within the textbook database (secure in the knowledge that any changes will be reverted when they reload the page in their browser), the bookstore dataset was specifically designed with data modification in mind. To avoid complications in introducing data modification queries, there are no foreign key constraints on the tables. However, the tables as provided have a one-to-one relationship: each record in bookstore_sales matches exactly record in bookstore_inventory, while each record in bookstore_inventory matches at most one record in bookstore_sales.

This dataset is used briefly to illustrate one-to-one relationships in Chapter 1.4. In Chapter 1.5, the tables are used to demonstrate data modification queries. The tables are used in examples of grouping and aggregation in Chapter 1.9.

The bookstore_inventory table contains information on new and used books for sale.

column

type

description

stock_number

integer

unique key identifying a particular copy of a book

author

character string

the author of the book

title

character string

the title of the book

condition

character string

the condition of the book (new, good, fair, etc.)

price

fixed-point

the price of the book, in some unit of currency

The bookstore_sales table gives information about the sales of books from bookstore_inventory.

column

type

description

receipt_number

integer

unique key identifying this sale

stock_number

integer

key of the book copy (from bookstore_inventory) that was sold

date_sold

date

the date on which the books was sold

payment

character string

the method of payment used in the sale (case, credit card, etc.)

Use the query tool below to view the data.

Other tables in the database

In addition to the above, the database contains some very small tables which are used to illustrate various points in the text: fruit_stand, s, s2, s3, t, t2, and t3. As these tables are give in full in the text, we do not further explain them here. However, you can use the interactive query tool below to view the data.

Data collection notes

The selection of books in the simple and expanded books datasets was driven by several interests. First, the textbook’s author is a fan of science fiction and fantasy - these genres are thus perhaps overrepresented in the dataset. Second, an attempt was made to include a diverse set of authors, in hopes of providing some common points of reference for students everywhere. (There is nevertheless a strong bias towards English-language literature, reflecting the context of the textbook’s author.) Lastly, the selection favored books by authors who have received literary awards. To keep things manageable, the number of books was capped at 200.

Authors were handpicked, based on the textbook author’s knowledge and on internet searches of major literary awards. Wikipedia was extremely helpful in collecting data such as authors’ birth and death dates. (Note that the dataset was finalized in 2022; it is very possible that some authors in the dataset have died since the data was entered. This inaccuracy is regrettable but unavoidable under the circumstances.)

Some authors are represented by a single work (typically an award winning or at least widely acclaimed book), while in other cases many books by the author were included. There was no consistent decision process used to decide how many books to include by each author. Many choices were made in support of specific examples in the text. Most of the data regarding individual books was hand collected from Wikipedia and other freely available sources on the internet. Some publication year values were extracted from the freely available Open Library dataset.

Data on author and book awards was hand collected from Wikipedia and the websites representing the awards. Data on the many editions of works comes entirely from the Open Library.

To the best of the author’s knowledge, all of the data in the simple and expanded books datasets is in the public domain.

The bookstore dataset uses books from the expanded books dataset, but is otherwise entirely fabricated, as are the fruit stand and other abstract example tables.

Getting the data

Part 1 of this book includes interactive elements allowing the reader to work directly with a relational database. This functionality lets students immediately try example code on a real database system. As the database available on each page is actually a copy (in memory) of a fixed database, changes to the database do not persist over time - refreshing the browser window will return the database to the same initial state each time. This is useful in that students can safely experiment with destructive SQL commands, knowing that no changes are permanent. On the other hand, it means that students cannot use the system for longer term projects.

The database system used in this textbook is SQLite. While SQLite is a powerful and popular relational database system, it lacks some features of the client-server database systems commonly used in industry. It also differs in significant ways from the SQL standard (notably with its use of dynamic typing).

For these reasons, users of this textbook may wish to set up their own database system. Many different database systems are available, each with their own system requirements and installation procedures. There are likewise many ways to access and query each database system. Instructions for setting up and accessing different systems are therefore out of the scope of this textbook. However, in the interest of providing a transition from the textbook’s database to the users’ systems of choice, we provide scripts and data files below, which can be used to recreate the book’s database on the database system of your choice.

SQLite

The SQLite database engine is unique (in the set of systems supported by this textbook) in that it works with databases stored entirely in a single file. SQLite databases can be shared by simply copying the files containing them. We provide below the database file used by this textbook. Additionally, we provide a SQL script (a text file in UTF-8 Unicode format) with the SQL commands necessary to re-create the database from scratch. Use the former if you simply want a copy of the textbook’s database for your own use; use the latter if you want to create the textbook’s database tables within an existing SQLite database. The SQL script contains only CREATE TABLE and INSERT statements, and therefore should not replace existing tables within the database.

This SQL script was verified to work correctly using version 3.39.2 of the sqlite3.exe program on Windows 10 and with version 3.31.1 of the sqlite3 program running on linux (Linux Mint 20.3 with kernel version 5.15.0-41).

PostgreSQL

The SQL script below can be used to create the equivalent of the textbook’s tables in a PostgreSQL database. The SQL script contains only CREATE TABLE and INSERT statements, and therefore should not replace existing tables within the database.

This SQL script was verified to work correctly using version 12.5 of the psql.exe program on Windows 10 and with version 12.11 of the psql program running on linux (Linux Mint 20.3 with kernel version 5.15.0-41), loading into a PostgreSQL version 12.8 instance running on linux (Linux Mint 20 with kernel version 5.4.0-86). Note: on Windows 10, you may need to first issue the command \encoding utf8 (at the psql command line) before loading or querying the data. This setting may not be sufficient to ensure all characters can be viewed correctly when returned by a query, but the data can be loaded correctly.

Notable differences from the textbook:

  • As described in the text, SQLite does not use a standard SQL approach to automatically generate sequential ID values. The AUTOINCREMENT option used in the SQLite database (in bookstore_sales and bookstore_inventory) is not available in PostgreSQL, but the standard SQL GENERATED BY DEFAULT AS IDENTITY option is. Accordingly, the PostgreSQL script uses the standard approach. The two options behave slightly differently.

MySQL

The SQL script below can be used to create the equivalent of the textbook’s tables in a MySQL database. The SQL script contains only CREATE TABLE and INSERT statements, and therefore should not replace existing tables within the database.

This SQL script was verified to work correctly using version 8.0.29 of the MySQL Shell (mysqlsh.exe) program on Windows 10 and with version 8.0.29 of the mysql program running on linux (Linux Mint 20.3 with kernel version 5.15.0-41), loading into a MySQL version 8.0.26 instance running on linux (Linux Mint 20 with kernel version 5.4.0-86).

Notable differences from the textbook:

  • The bookstore_sales table in the textbook database has a DEFAULT clause to set the date_sold column to the current date when no value is provided for the column. MySQL does not permit default setting for columns of type DATE, but does allow it for the TIMESTAMP type. Accordingly, the date_sold column in the MySQL script is of type TIMESTAMP (and thus includes time as well as date).

  • The MySQL AUTO_INCREMENT option is very similar to the AUTOINCREMENT option in SQLite (used in bookstore_sales and bookstore_inventory), but may have slightly different behavior.

  • The bookstore_inventory table contains a column named condition. This is a reserved keyword in MySQL, which means that queries such as “SELECT DISTINCT condition FROM bookstore_inventory” will fail unless you put backticks around the word “condition”. (The backtick character looks like an apostrophe, but slanting in the opposite direction.)

Oracle

The SQL script below can be used to create the equivalent of the textbook’s tables in an Oracle database. The SQL script contains only CREATE TABLE and INSERT statements and statements setting temporary session variables, and therefore should not replace existing tables within the database.

This SQL script was verified to work correctly using Oracle’s SQLcl utility (release 22.2) running on linux (Linux Mint 20.3 with kernel version 5.15.0-41) with OpenJDK version 11.0.15, loading into an Oracle Database XE 18c instance running on linux (openSUSE Leap 15.2 with kernel version 5.3.18). On Windows 10 with SQLcl (release 22.2), the script ran without reporting errors, but some character values were loaded incorrectly. Note: if you run this script with SQLcl or SQL*Plus, you must uncomment the command SET DEFINE OFF at the top of the script. Otherwise, the program will interpret any & characters to imply a variable substitution sequence, which will halt the script and prevent the data from loading correctly.

Notable differences from the textbook:

  • As described in the text, SQLite does not use a standard SQL approach to automatically generate sequential ID values. The AUTOINCREMENT option used in the SQLite database (in bookstore_sales and bookstore_inventory) is not available in Oracle, but the standard SQL GENERATED BY DEFAULT AS IDENTITY option is. Accordingly, the Oracle script uses the standard approach. The two options behave slightly differently.

SQL Server

The SQL script below can be used to create the equivalent of the textbook’s tables in a Microsoft SQL Server database. The SQL script primarily contains CREATE TABLE and INSERT statements, and should not replace existing tables within the database. The USE statement at the top of the script assumes that data will be loaded into an existing database named “textbook”. This statement is needed for use with the sqlcmd utility and possibly other software, and should be edited to indicate the correct database as needed. The statement may not be needed in other client software programs.

This SQL script was verified to work correctly using the sqlcmd program (version 17.10.0001.1) running on linux (Linux Mint 20.3 with kernel version 5.15.0-41), loading into a SQL Server 2019 instance running on linux (Linux Mint 20 with kernel version 5.4.0-86). On Windows 10 with version 15.0.2000.5 of sqlcmd.exe, the script ran without reporting errors, but some character values were loaded incorrectly. Note that a collation supporting UTF8 must be used (this can be set on the database using an ALTER DATABASE statement); the test system used the “Latin1_General_100_CI_AS_SC_UTF8” collation.

Notable differences from the textbook:

  • SQL Server generates sequential integer values for columns with the IDENTITY property, which differs in behavior compared to both the standard SQL GENERATED BY… and SQLite’s AUTO_INCREMENT.

Raw data files

If you wish to use a database other than one of those listed above, you can likely adapt one of the above scripts for use with your database, using the find/replace function of a text editor and some trial and error. Alternatively, you may create the desired tables manually, and load the data from the data files in the zip archive linked below - most database systems provide mechanisms to load data from formatted files. The data files are in comma-separated value (CSV) format, and are encoded in UTF-8. Each file includes a header row with labels matching the column names from the textbook database. If your database system does not support this file format, you may be able to open the files with a spreadsheet program and then export a format that is supported by your system.


You have attempted of activities on this page