Relational Databases

Goals

Concepts

Preparation

  1. Download and install pgModeler.
    1. Use pgModeler version 0.9.1-alpha1 or later.

Lesson

Persisting data is essential in almost every application. Some applications create simple data that must be remembered, such as settings changes and log information. Other data programs need to query or look up data that is already saved somewhere based upon some selection criteria.

For small data sets, the object serialization techniques you've learned in previous lessons may be sufficient. But as soon as the amount of data grows to a moderate level, serializing individual objects becomes difficult, especial when it comes to managing the relationships among the objects. A database typically stores general data, encodes relationships among the data, and allows queries to find data later.

Relational Databases

Some of the simplest databases are stored in so-called flat files, with rows of text data separated by commas or tabs. In flat-file databases, the consumer application usually must “connect the dots”, as it were, and not only determine what the data means but also how pieces of the data relates. If a library is available for taking care of some of this work of reading and connecting the information so that the application doesn't have to, it is called a database system.

A database system represents entities, which are analogous to objects with unique identifiers. A Vehicle is one example of an entity; each vehicle (entity instance) might be identified by a vehicle identification number (VIN). An entity may have various attributes, such as year of production. Any one Vehicle might be related to its Manufacturer, the particular entity that produced it.

Early database systems conceptually arranged entities into trees of hierarchies. Later systems improved these data models by allowing general graph or “network” data models to indicate more complex relationships among entities. But these databases were still inflexible and were hard to query.

Databases were revolutionized in 1970 when E. F. Codd, a computer scientist at IBM, proposed A relational model of data for large shared data banks. Based upon a rigorous mathematical foundation, Codd's proposal resulted in the development of the relational database management system (RDBMS), which remains the most popular database system in use today. An RDBMS allows entities to be represented in general ways, and then queried in an almost unlimited fashion.

Relations

The fundamental modeling unit of an RDBMS is a relation, a mathematical concept that arises from arranging data into rows and columns. Each relation represents an entity. Each row represents an instance of that entity, with each column representing an attribute of that the entity.

Initial “Item Stock” table from Indirection lesson.
Item Stock
Item Stock Supplier
table 20 Acme Furniture Company, 123 Some Street
chair 15 Acme Furniture Company, 123 Some Street
balloon 99 Bozo Toy Company, 321 Other Street
kite 5 Bozo Toy Company, 321 Other Street
doll 10 Bozo Toy Company, 321 Other Street

Think back to the very first lesson of this course, which explained the concept of indirection. The store owner in that lesson needed to keep track of the stock of toys and furniture, along with their suppliers. The store owner, perhaps using only a pad of paper and a pencil, may have referred to this as a “list”. As you completed the task for that lesson, you may have thought of the information as being arranged in a “table”. In the relational model, the correct terminology for this arrangement of data is a relation.

But a relation is not just any tabular representation of information. A relation has certain formal pieces that must adhere to certain rules for the relational model to work. You may hear people refer to “tables”, “rows”, and “columns” when discussing relational database management systems, but relations are not general tables. The rows and columns of a relation have special rules; ignoring them will cause problems in your model and provide erroneous results in your queries. In casual conversation you may use the words “tables”, “rows”, and “columns” to simplify communication, but don't forget that there is formal distinction between these items in common use and the formal definition of a relation.

Attributes

Each “column” in a relation represents an attribute of the relation. Each attribute is identified in a special “row” called the relation heading. Unlike general table columns, attributes follow certain rules:

Thus the Item_Stock relation above has three attributes: Item, Stock, and Supplier. Each will have a type, and they may appear in any order.

Tuples

Each “row” in a relation is referred to as a tuple, and represents an instance of the entity. Unlike general table rows, tuples follow certain rules:

The Item_Stock relation above contains five tuples, which may appear in any order.

Sets

The relational database model is based on set theory. This point is essential to understanding programming relational databases. Storing, querying, and updating data in a relational database is based on operations or functions performed on sets of data.

Thus the definition of some relation such as Item_Stock is a set of attributes. The actual relation (the data that Item_Stock contains at any particular time) is a set of tuples. Programming a relational database is essentially indicating what operations to perform on these sets of data.

Semantics

The relational model is more than just a way to organize data. The “tables”, “rows”, and “columns” mean something. As you might have guessed, each relation represents a type of entity such as a vehicle. The heading indicating what attributes a Vehicle is allowed to have.

Each tuple represents a instance of an entity, as if you created a vehicle using new Vehicle() in Java. The value in each “table cell” specifies a value of an entity's attribute. Seen in this way, each tuple is a set of logical propositions about an entity instance, asserting for example “such and such Vehicle has the color blue” and “such and such vehicle has four wheels”.

Keys

“Such and such” a vague way to identify a vehicle. A common aspect of historical database systems has been the storage of data about an entity in a record, essentially an array of variables potentially of different types. But different systems traditionally used different ways to link the records together. Part of the “secret sauce” of relational databases is that the relational model itself provides a way to identify a tuple, which is analogous to a record in previous database models. The term “record” is still used in casual speech when referring to the information in a tuple.

Any attribute or collection of attributes that uniquely identifies a tuple within a relation is called a key. In a relational database need to select one of these keys (sometimes called candidate keys) to serve as the primary key, or official unique identifier for the entity.

Constraints

A primary key is an example of a constraint, which is a rule restricting the values placed in the database. A primary key constraint, for example, merely tells the database that a particular attribute serves as the official tuple identifier. But if there are other attributes that must have different values for each entity, you can add a unique constraint that prevents duplicates.

The benefit of a constraint is that it is maintained by the database system, without any extra work on the part of the application. If your program attempts to add two tuples with attribute values that violate a unique constraint, for example, the attempt will fail with an error. Constraints thus ensure integrity of the database; any data stored is guaranteed to be bound by all constraints.

Databases

While relations and their subcomponents form the heart of a relational database, an RDBMS organizes relations into larger structures still. Although the terminology varies among database products, the following represents the general arrangement of logical structures:

schema
A schema groups a set of related tables. The word “schema” can also refer to the definition of the relations; see Schema Design below.
catalog
All the available schemas are then grouped into a catalog. In many RDBMS products, a catalog is simply called the “database”. Some products use the word “catalog” to refer to special system schemas holding available types and other metadata.
cluster
Multiple catalogs or databases can in turn be grouped into even larger arrangements called “clusters”. In some products a “cluster” may be synonymous with “database server”, the installation of the product on a particular machine. Historically the word cluster referred to the physical grouping of some database elements into a single file for faster access.

Schema Design

In colloquial use the word “schema” can also mean a set of rules defining relational structures such as relations, attributes, and keys. Designing the schema is similar to designing the classes and methods of an object-oriented program. As with classes, the design of tables is derived from the domain model that defines the conceptual view of the program. (Here the “domain” of the program's domain model, which refers to the world-view the program works with, is different from the “domain” of a relation attribute, which refers to the attribute's value type.)

In creating your program's domain model you used UML to create a “sketch” of of the central objects in the domain, and used the same modeling language to produce a “blueprint” of how those objects could be turned into Java classes. In a similar way you will take the original domain model and create a rough view of of the tables, attributes, and relationships in your database schema. But the initial plan for a database schema is almost never sufficient; you will need to refine your schema so that it represents the data in the most efficient way for the relational paradigm.

Entity Relationship Diagram

While a UML class diagram is the diagram of choice for illustrating the domain model as well as the blueprint of actual classes, the database world mostly uses the Entity-Relationship (ER) diagram. Charles Bachman was one of the first to use diagrams to represent data in the 1960s. Building on Bachman's notation, Peter Chen in the 1970s developed the ER approach, in which rectangle represent entities and lines connecting the rectangles indicate relationships.

Chen's original notation for describing relationships included a diamond to describe the relationship. Because the diamond takes up extra space in a diagram, some authors simplify the notation to a single line, with the name of the relationship above the line. Nowadays the most popular ER diagram notation is the information engineering (IE) or “crow's foot” notation, named for the diagram's approach for representing relationship multiplicity, as explained in Cardinality below.

Cardinality

UML class diagrams come with a notation for indicating the “multiplicity” of a relationship, that is the minimum and maximum of each entity type that can appear in the relationship. For example in UML 1..* indicates “one or more”. The multiplicity consists of two parts: the minimum number (here 1 meaning “one”), and the maximum number (here * meaning unlimited or “many”).

The maximum number of entities that can appear on one side of a relationship is called the cardinality. Referring to an entity on the left of the relationship line, Crow's foot notation indicates a “one” cardinality with a bar |, and a  “many” cardinality with a greater-than symbol. Together with the relationship line, the “many” symbol > appears somewhat like a bird's foot, which is how the notation got it's nickname, “crow's foot”.

There are thus three fundamental combinations of modalities in ER diagrams, just as in UML diagrams. A fourth type, a many-to-one (M:1) relationship, is simply the inverse of a one-to-many (1:N) relationship.

one-to-one (1:1)
|    |
one-to-many (1:N)
|    <
many-to-many (M:N)
>    <

There relationship representing “manufactures” between a Company and a Vehicle, for example, would have a cardinality of one-to-many or Companymanufactures  <Vehicle. Each company may manufacture many vehicles, but each vehicle is only manufactured by only one company.

Modality

While cardinality refers to the maximum number of a type of entity participating in a relationship, modality refers to the minimum number. In simplest terms this means that the entity is “optional”, denoted by a circle , or “required”, again donated by a bar |. The modality, however, is shown on the inside of the relationship line, between the cardinality symbols (that is, not next to the entity it is associated with).

The cardinality of the “manufactures” relationship would be represented as Company|| manufactures ○<Vehicle. A vehicle cannot be manufactured without a company, which is “required”, but a company may exists that has not manufactured any vehicles at all, which would indicate a modality of “optional”.

Entities

As discussed above, a relation represents a class of entities, with each tuple representing an instance of the entity. For much of your domain model, creating your schema will be as simple as creating an entity in the diagram, representing a relation, for each of your domain model classes in your class diagram.

Keys

A more difficult task is finding an attribute to serve as the primary key of an entity. A primary key must be unique and must not change. Some entities may already have an attribute, a natural key, that uniquely identifies instances. For example the automotive industry generates a vehicle identification number (VIN) that uniquely identifies each vehicle produced. A VIN could be used as the primary key of a Vehicle relation in a database for a used-car sales company.

If no suitable natural key is available, you will need to create a surrogate key, an artificial identifier that brings uniqueness to the tuple. Many databases provide a facility to atomically generate an arbitrary counter when a new record is added. Another option is to use a universally unique identifier (UUID), which practically guarantees uniqueness even though they are independently generated.

Relationships

Foreign Keys
“Item Stock” and “Suppliers” tables from Indirection lesson.
Item Stock
Item Stock Supplier
table 20 ACME
chair 15 ACME
balloon 99 BOZO
kite 5 BOZO
doll 10 BOZO
Suppliers
ID Name Address
ACME Acme Furniture Company 123 Some Street
BOZO Bozo Toy Company 321 Other Street
TOP Top Toys Corporation 456 Far Avenue
These informal tables are not relations! “Item Stock” has no primary key, for example.
Department store domain model.
Department store domain model.

In a relational database, a relationship is most always modeled using a foreign key. A foreign key is merely an attribute in one relation that references the primary key of another table. Defining a foreign key constraint allows the database to ensure that for every foreign key in one relation, there exists a matching primary key int the referenced relation; this maintains the referential integrity of your database.

Using a single code to reference rows in another table is exactly the approach you intuitively used in the initial lesson to separate “Item Stock” and “Suppliers” into separate lists, reproduced in the figure. In the “Item Stock” table, the “Supplier” column served as an informal foreign key to the supplier in the “Suppliers” table. There the “ID” column services as a primary key to identify each supplier.

Association Entities

Many-to-many relationships pose an interesting problem in the relational model. Imagine that the store owner has placed each item in one or more categories. That is, each category, such as “furniture”, can contain many items such as “table” and “chair”. Additionally, some items may fall into multiple categories, such as a game table that is both “furniture” and a “toy”. There is therefore a many-to-many relationship between the Item and Category entities.

Department store domain model with category.
Department store domain model with categories.

Simply adding a foreign key to the Item and/or Category entities is insufficient to represent a many-to-many relationship. The solution is to add yet another layer of indirection by creating an entity that represents the relationship of a particular item being in a particular category. This type of entity is called an association entity. It goes by many names including “junction table”.

Furniture schema.
Department store domain model with categories.
Generalization
Vehicle class hierarchy with properties.
Vehicle class hierarchy with properties.

Mapping an inheritance hierarchy of super classes and subclasses (what UML refers to as “generalization” in class diagrams) to relations requires additional thought. In your object-oriented domain model, generalization indicates that some some class contains common properties that apply to several child classes. The child subclasses likely contain their own specialized properties in addition to those of the super class. Although some database products have added object-oriented features, the relational model does not include the concept of entities that are specializations of other entities, so you will have to use special techniques to represent this in your schema.

To use the examples from the lesson on inheritance, an AbstractVehicle base class might have a color property which would apply to all vehicles. You might have created AbstractWheeledVehicle to serve as a base class to all vehicles with wheels; it might add a wheelCount property. The concrete class Truck might indicate how much of a load it can carry, in some unit such as kilograms. A separate Boat class would inherit the color property from AbstractVehicle, while adding an addition property hullCount to indicate the number of main body portions the boat has.

There are three common approaches for representing inheritance in a relational database schema, each with benefits and trade-offs. You will need to evaluate the approach that best fits your design and use cases. This introduction uses the terminology from Martin Fowler's Patterns of Enterprise Application Architecture; more information and a reference site can be found at the end of this lesson.

single table inheritance
You could create a single relation to represent the entire hierarchy of vehicles. It would need some attribute representing which concrete class was instantiated. The relation heading would include all the possible attributes in the entire hierarchy: in this case color, wheelCount, maxLoad, and hullCount. Each tuple would only supply values for those attributes relating to its indicated type. This technique can result in a large, confusing table with many NULL for many attributes.
concrete table inheritance
Instead of creating a single relation for all classes, you could create a different relation for each concrete class, with each relation including all the attributes related to that class. In this example, you would create a Truck relation with color, wheelCount, and maxLoad attributes; and a Boat relation with color and hullCount attributes. This approach results in duplicate attributes across relations. The database can't enforce relationships in this pattern; you might wind up with two vehicles with conflicting primary keys because they were in different relations.
class table inheritance
Finally you could create a separate relation to represent each entity in the hierarchy—even the abstract classes. The “root” relation would ensure the vehicle identifier is unique across all vehicle types. Each superclass relation would contain attributes relating to itself and all its subclasses. Finally each relation representing a subclass would contain a foreign key indicating the tuple in the super class relation containing its super class properties. You might choose to use the same primary key values across all tables, with these same keys serving both as primary keys and as foreign keys for the subclass relations. Class table inheritance mostly closely reflects the original object-oriented domain model from a relational perspective. This arrangement makes the schema and consequently the application logic more complicated, requiring more complex queries to retrieve the data for a single object instance.

Normalization

The relational model is flexible and permits data to be stuffed into relations in many forms. But not every schema follows relational principles. The relational model has a rigorous mathematical foundation, and it in order to efficiently represent and query information, the schema design should follow certain rules. Breaking these rules may result in duplicated and inconsistent data, as well as make it hard to retrieve data once the data is stored.

Normalization is the process of changing a schema to follow relational principles, preventing data duplication and allowing for flexible querying. Normalization could be considered a type of refactoring of a schema, except that it follows a more formal set of rules. The primary goal of normalization is simple: each tuple in a relation should represent a single instance of a “thing”, identifiable by its primary key. More formally the concept is known as functional dependence (FD), and ensures that the primary key functions like a key in a map, “determining” the values of the other attributes of a tuple.

Normal forms.
5 Fifth Normal Form
4 Fourth Normal Form
3
Boyce-Codd Normal Form
Third Normal Form
2 Second Normal Form
1 First Normal Form

There are several ways that a schema could have broken this principle, which is why normalization sets out a series of normal forms that a schema can be in. The normal forms are arranged in a sequence; modifying a schema so that it is one normal form (or verifying that it is already in that normal form) guarantees that it adheres to that form and the normal forms before it. There are at least seven normal forms, and academics are coming up with more all the time.

First Normal Form

Expanded “Item Stock” table from Indirection lesson.
Item Stock
Item Code Item Stock Supplier Categories
100 table 20 Acme Furniture Company, 123 Some Street furniture
200 game table 12 Acme Furniture Company, 123 Some Street furniture, toy
300 chair 15 Acme Furniture Company, 123 Some Street furniture
400 balloon 99 Bozo Toy Company, 321 Other Street toy
500 kite 5 Bozo Toy Company, 321 Other Street toy
600 doll 10 Bozo Toy Company, 321 Other Street toy

The most basic form of normalization, First Normal Form (1NF), requires that each attribute value be atomic, that is, it can't be broken down into smaller pieces. Take another look at the “Item Stock” table the store owner was using in the lesson on indirection. The store owner has decided to note the categories of each item. There are two ways in which this table is not in 1NF:

You already saw how Category really should be a separate entity. Having a many-to-many relationship with the items, an additional ItemCategory association entity should be created as well.

Second Normal Form

“Item Category” table in only 1NF.
Item Category
Item Code Category Code Category Name
100 AAA furniture
200 AAA furniture
200 BBB toy
300 AAA furniture
400 BBB toy
500 BBB toy
600 BBB toy

The Second Normal Form (2NF) requires that, in addition to being in 1NF, if there is a composite primary key, the other attributes must be functionally dependent on the entire key. Suppose that the store owner created only an Item Category table that included the item code, the category code, and the category name. This indeed allows a many-to-many relationship; note that item with code 200 (the game table) has both AAA (furniture) and BBB (toy) categories. Similarly there can be many items for one category.

The problem is that the Category Name is functionally dependent on the Category Code, which is only part of the composite key! This results in a lot of duplicated data, and can produce anomalies as explained below. The solution is to recognize that Category should be a separate entity; the Category Name should be moved to a separate Category table, and the Item Category should consist of a composite key to join the Item Stock and Category tables, as illustrated above in the section on Association Entities.

Third Normal Form and Boyce-Codd Normal Form

Suppose further than in addition to splitting out the categories, the owner created Supplier ID, Supplier Name, and Supplier Address columns in the original Item Stock table. As required by 2NF, the Supplier Name and Supplier Address are functionally dependent on the entire primary key, the Item Name. That is, any item code will completely determine the supplier ID, the supplier name, and the supplier address.

Expanded “Item Stock” table in only 1NF.
Item Stock
Item Code Item Stock Supplier ID Supplier Name Supplier Address
100 table 20 ACME Acme Furniture Company 123 Some Street
200 game table 12 ACME Acme Furniture Company 123 Some Street
300 chair 15 ACME Acme Furniture Company 123 Some Street
400 balloon 99 BOZO Bozo Toy Company 321 Other Street
500 kite 5 BOZO Bozo Toy Company 321 Other Street
600 doll 10 BOZO Bozo Toy Company 321 Other Street

But the presence of so much duplicated data should be a hint that there is a problem. Duplicated data not only takes up extra space, it may also lead to an anomaly, in which changes to the data produce other, unintended results. This table exhibits three common anomalies of a relational database that has not been normalized. These anomalies correspond to “create”, “update”, and “delete” of the CRUD operations you are familiar with.

insertion anomaly
It is impossible to add a supplier without there being some item associated with the supplier.
update anomaly
Because the supplier address is duplicated, updating the supplier address for one item, such as “chair”, results in a discrepancy with the same supplier's address as listed for the other items, such “table”.
deletion anomaly
Removing an item that is no longer sold would also remove the supplier, if that supplier didn't supply any other items.

The problem with the new Item Stock table is that the Supplier Name and Supplier Address attributes are functionally dependent on the full Item Code key, they exhibit a transitive functional dependency. Rather than being directly dependent on Item Code, they are each actually functionally dependent on Supplier ID, which in turn is dependent on Item Code.

The Third Normal Form (3NF) requries that, in addition to being in 2NF, there must be no transitive functional dependencies. The solution as before is to recognize that Supplier is a separate entity with Supplier ID, Supplier Name, and Supplier Address attributes. This in fact is exactly what the store owner did in the lesson on indirection, explained in the above section, Foreign Keys.

Products

The relational model has provided popular and useful for decades, so there are many RDBMS products available. Here are some of the significant products you should be aware of, along with any modeling software tailored specifically to that product. The popularity of a particular product does not always reflect how closely it follows the relational model or related industry standards.

Oracle Database
One of the oldest relational database products, Oracle is very popular for corporate enterprise deployments. Oracle provides the free SQL Developer Data Modeler software, written in Java, for modeling relational database for the Oracle platform. If you plan on actually connecting to an Oracle database, you should probably opt instead for the full Oracle SQL Developer software, also free, which provides live data query and manipulation capabilities in additional to all SQL Developer Data Modeler functionality.
SQL Server
SQL Server is Oracle's main enterprise competition from Microsoft Corporation. For database design Microsoft provides SQL Server Data Tools for Visual Studio as a component of its larger development environment.
PostgreSQL
With roots going back over three decades, PosgreSQL is a robust, enterprise-class, open-source database system. The open-source pgModeler software is an attractive package for database design tuned specifically for PostgreSQL. This course will primarily use PostgreSQL. You can compile the pgModeler source for free, or download a precompiled demo version supporting a limited number of tables. A full, unlimited version of the software is available for free to official students of this GlobalMentor course; contact your instructor for details.
MySQL
MySQL is another open-source database. Oracle Corporation, which now owns MySQL, makes the product available in a free community addition along with various commercial distributions. The MySQL Workbench software is a colorful, user-friendly tool for designing databases to run on MySQL. It is available in a free community edition, either bundled with MySQL or as a separate installation. You are required to create and sign in with a free Oracle account before you can download any MySQL software.
SQLite
SQLite is a public-domain database with a small footprint. It is made to be embedded in the application, and it claims to have the most installations of any database system. Although not a full modeling tool, the open-source DB Browser for SQLite provides a visual interface for designing and editing SQLite database files. The SQLite database system does not ensure that attribute values are of the same type as the the attribute domain; it should only be chosen if its ability to be embedded outweighs its lack of domain integrity.
H2
The H2 Database Engine is a small, fast, open-source, pure-Java database that is highly compliant with the most recent database industry standards. It can be used in both embedded and client-server mode, and even has an option for storing the entire database in memory. There is no data modeling software specifically targeted for H2, but many modeling tools allow connections to H2 databases.

Review

Gotchas

In the Real World

Think About It

Self Evaluation

Task

Enhance your Booker domain model to keep track of the name, address, and web site of each publisher.

Use pgModeler to design a relational database schema for your Booker domain model. Represent the schema with an ER diagram using crow's foot notation.

See Also

References

Resources

Acknowledgments