Celko on SQL: Natural, Artificial and Surrogate Keys Explained - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Software // Information Management
10:44 PM

Celko on SQL: Natural, Artificial and Surrogate Keys Explained

SQL programming guru Joe Celko offers a classification scheme and advice on using the right keys.

Joe Celko
Joe Celko
There is no such thing as a "universal, one-size-fits-all" key. Just as no two sets of entities are the same, the attributes that make them unique have to be found in the reality of the data. I invented a classification scheme years ago that seems to be catching on (see table below). You can decide on the kind of key you want to use based on the nature of your particular situation.

1. A natural key is a subset of attributes which occur in the data model and act as a unique identifier. They are seen by the user. You can go to the external reality and verify them. You would also like to have some validation rule, such as look-ups and check digits in a UPC code (See "QUOTE" and reread the parts about validation and verification if you have questions).

2. An artificial key is an extra attribute added to the table which is seen by the user. It does not exist in the external reality, but can be verified for syntax or check digits inside itself. It's up to the DBA to maintain a trusted source for them. For example, a user can assign the open codes in the UPC scheme to products. In the grocery business, this might be where you find bread baked in the store, rather than bread bought from a national bakery chain. The check digits still work the same way to validate the barcodes, but you have to verify them inside your own enterprise.

Another technique is to take a natural key that is made up of many known attributes and put them into a formula to generate a key -- a hashing function. There are several downsides to this approach. The hashing function has to be perfect (that means no collisions or duplicate values from different inputs), and you still have to guarantee the uniqueness of the subset of attributes that make up the natural key. You will sometimes see an artificial key called a "surrogate key," but this is wrong according the definition of a surrogate given by [relational database pioneer] Dr. Edgar F. Codd.

If you have to construct a key yourself, it takes time to design them, to invent a validation rule, set up audit trails, etc. It's not easy work.

3. An "exposed physical locator" is not based on attributes in the data model and is exposed to the user. There is no reasonable way to predict it or verify it, since it usually comes from the physical state of the hardware at the time of data insertion. The system obtains a value thru some physical process in the storage hardware totally unrelated to the logical data model – for example, IDENTITY columns or other proprietary, non-relational auto-numbering devices.

Technically, these are not really keys at all, since they are attributes of the physical storage and are not even part of the logical data model. But they are handy for lazy, non-RDBMS programmers who do not want to research or think! This is the worst way to program in SQL.

4. A surrogate key is system-generated to replace the actual key behind the covers where the user never sees it. It's based on attributes in the table. Example: Teradata hashing algorithms, indexes, pointer chains, ADABASE numbers, etc.

The fact that you can never see it or use it for DELETE and UPDATE or create it for INSERT is vital. If users can get to the surrogate key, they will screw up the data integrity by getting the real keys and these physical locators out of synch. The system must maintain them.

In "ACM Transactions on Database Systems," Dr. Codd wrote that "…database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them..."

This means that a surrogate ought to act like an index, hash table, bit vector or whatever; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as permanent surrogates for entities:

1. The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

2. Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

3. It may be necessary to carry information about an Entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree)."

These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution is to introduce entity domains that contain system-assigned surrogates.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

How SolarWinds Changed Cybersecurity Leadership's Priorities
Jessica Davis, Senior Editor, Enterprise Apps,  5/26/2021
How CIOs Can Advance Company Sustainability Goals
Lisa Morgan, Freelance Writer,  5/26/2021
IT Skills: Top 10 Programming Languages for 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  5/21/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll