Primary and Foreign Keys

02 Feb 1996 updated



Why primary keys are superfluous

Why do we need primary keys? What part do they play in relational theory, entity-relationship modeling, and relational database management systems? Of course everyone knows that primary keys are not only important but absolutely necessary. In the acknowledged bible of relational theory, {Codd90} says so on page 35 (RS-8):

"For each and every base R-table, the DBMS must require that one and only one primary key be declared."

Unfortunately the good doctor does not directly explain why a primary key is required. Let's explore this a bit. We know that any primary key must also be a candidate key (see Relational keys) so it is not enough to say that the primary key uniquely identifies every row - so does every candidate key.

Consider this little example above - a table of the states in the USA. The first column is the full name of the state - which is non-null, unique, and therefore a candidate key. The second column is the two letter state code assigned by the US Postal Service many years ago as an artificial identifier for address abbreviation and mail handling. Therefore it is also non-null, unique, and a candidate key. The third column illustrates the concept of a system assigned row ID number, something which many database designers use frequently. Being machine assigned, it is also non-null, unique, and a candidate key.

Clearly, each of the three columns in this table, or any two of them, or all three together, constitute a candidate key and could be used as the primary key. (Date states that a primary key must also be "minimal" but not everyone agrees) Which candidate to choose as the primary key is purely artificial - a decision made by the designers for their own good reasons. {Date95} says on page 115 "If the set of candidate keys actually does include more than one member, then the choice of which is to be primary is essentially arbitrary." But if there is no basis in "Codd's law" for choosing a primary key, what then is the need to have one?

One could argue that a primary key is necessary as the target of any foreign keys which refer to it from other tables. A foreign key is a set of one or more columns in any table (not necessarily a candidate key, let alone the primary key, of that table) which may hold the value(s) found in the primary key column(s) of some other table. So we must have a primary key to match the foreign key.

Or must we? The only purpose of the primary key in the primary key/foreign key pair is to provide an unambiguous join - to maintain referential integrity with respect to the "foreign" table which holds the referenced primary key. This insures that the value to which the foreign key refers will always be valid (or null, if allowed).

But can't we join unambiguously to any candidate key? Our simple table of states has three candidate keys. Why is the first (or second) any better than the third for maintaining referential integrity? Page 118 of {Date95} says "A foreign key value represents a reference to the tuple containing the matching candidate key value (the referenced tuple or target tuple)." So a primary key may not be necessary for referential integrity. Consider the following:

In fact, subtly contradicting himself, on page 41 (RS-14) {Codd90} says:

"One if the greatest advantages of the relational approach us that it supports joins of all kinds, whether based on keys or not."

and {Date95} makes the point very clear on page 117:

"Before going any further, we must point out that once again we find ourselves in an area where there is a certain amount of controversy. The relational model has historically required that foreign keys match, very specifically, primary keys, not just candidate keys [...] And previous editions of this book, as well as other publications by this writer [...], have supported this position. Our current position, by contrast, resembles our current position on primary keys in general: Requiring foreign keys to match primary keys specifically is probably a good idea in many cases - even in most cases - but it cannot be justified in all cases, unequivocally."

Some of you are probably saying impatiently "That sounds clever in a theoretical sort of way but RDBMS products really do need primary keys to enforce referential integrity." No, they don't. Let's check it out.

In general, RDBMS vendors implement either declarative or procedural RI. The declarative method (e.g., Oracle 7) employs a DDL statement which declares a foreign key constraint naming the FK columns and refering to the target table which holds the primary key. In some products, the exact PK columns can be named but in others the PK is plucked from the system catalog based on prior declaration. Either way, the exact implementation of the FK/PK validation is "under the covers" so we seem stuck with the RDBMS vendors implicit requirement of a PK to which the FK can refer.

In other RDBMS flavors (e.g., Sybase pre System 10), referential integrity is supported procedurally - through procedural code fired in triggers. Here lies a universal approach (to the extent that most databases now support triggers) to freeing ourselves from the tyranny of primary keys. If you inspect, for example, the INSERT trigger code used to maintain RI, you will find no magic. It contains a standard SELECT statement which compares the value(s) in the FK column(s) to those in some specifically named column(s) in the "primary key" table. It is not necessary that those target columns had been previously declared as "primary keys".

We concur with and defer to C. J. Date: on page 116 of {Date95} he says "Our current position on such matters is as follows: Choosing one candidate key (in those cases where there is a choice) as the primary might be a good idea in many cases - even in most cases - but it cannot be justified in all cases, unequivocally."


Where to go from here:


Copyright © 1996 Applied Information Science International