Primary keys only need to be unique; there’s many characteristics they often have but don’t need to have
There is often a misunderstanding about primary keys in database tables. Primary keys identify rows uniquely, so the only property they have to obey is they must be unique.
Primary keys don't have to be numbers. But if they are, here are a few properties these numbers don't have to have:
- Numerical PKs don't indicate ordering, i.e. which row was inserted first.
- Looking at the highest PK value doesn't indicate the number of rows created (cardinality).
- There are not guaranteed not to be "holes" in the numerical sequence: e.g. row 11 and row 13 exist, but row 12 doesn't.
In addition to PKs not having to have these properties, there are good reasons why auto-generated PKs actually don't have them practice.
Many types of systems (for example websites, or the central transaction database of a supermarket) measure their performance not by the duration of a particular transaction (0.10s or 0.08s doesn't make a big difference) but measure their performance by concurrency, which influences the throughput, i.e. if you can handle 100 or 1,000 transactions per second.
If all data and locks which are system-wide are stored on the database, then all other parts of the system (web servers, application servers, cash tills) can be independent of one another. If the performance of system were limited by one of these non-database parts, one can improve the throughput of the system (although not the duration of individual transactions) by increasing the concurrency, i.e. simply adding more of these non-database parts in parallel. (Surprisingly, the case in one system I'm building, concerning simulation of many different mobile phone tariffs for a particular phone usage scenario, this is indeed the case: the slowest part by far is the application server)
Increasing the concurrency in the database, on the other hand, is not such an easy task. One can't just split the data up into different databases: unique constraints must be enforced over all data, referential integrity (foreign keys) must be enforced over all data, locks must be system-wide (e.g. "does user X have enough money? first lock, then check, then take away money, then unlock")
But there are techniques available:
- Is the limiting factor the database CPU? If so, more cores can be added. If that limit is reached, an expensive cluster-database can be used. But, in my opinion, database should be storing and reading the database, and not doing CPU-intensive processing. If the processing is done by the application servers, then they can easily and cheaply be scaled, as stated above.
- Is the limiting factor the disk? One can add more disks, which can physically read/write at the same time. And this is really the (concurrency-reducing) contention point on most databases, and as databases are the contention point of the whole system, and as concurrency is the objective of scalability: Disk contention is the main focus of my work when I do performance consultancy.
So, given that the main contention point is the database, and that multiple parallel CPUs and multiple parallel disks are in use, it would be unfortunate to make e.g. one disk wait for the other to finish. It should be possible that both disks are in use at the same physical moment in time.
And to do this, one needs to make sure that there is no reason why one transaction (e.g. purchase by a user) needs to wait on another transaction (e.g. different purchase by a different user). One needs to use locking where appropriate – but no more (e.g. lock the user's balance for the duration of a transaction so if two items are charged the same account at the same time so one doesn't get "check balance transaction A -> OK; check balance transaction B -> OK; subtract A; subtract B" and find the user had enough money for either A or B but not A+B together – but two transactions for two different users can be totally independent).
Imagine if primary keys could indicate either ordering or cardinality, or weren't allowed to have holes. This would have the following consequences:
- If one transaction (e.g. purchase) needed to create a new row, it would need a new primary key value. It would need to acquire that primary key value, do some things with it, create the row, maybe do other things e.g. write some logs, and then "commit" the transaction. If, during this time, a second transaction (e.g. a purchase by a different user) were to wish to execute the same purchase procedure, it would also need to acquire a new primary key value. It wouldn't be able to get this value, until the database knew if the first purchase would "commit" or "rollback". So the execution of one transaction would have to wait on the execution of another transaction. One disk would be sitting waiting, while the second disk was at work.
- It is sometimes necessary to retrieve a primary key value before creating a new row. Only the database knows the current highest value in the table. The "retrieve primary key" request would have to send a request to the database, and wait for the reply. However, if the only requirement is uniqueness, the first such request can retrieve e.g. 20 PK values from the database, and the subsequent 19 requests can use the rest of the numbers. If the client crashes, those PK values remain unused.
- What is the "ordering" of two transactions which are written to separate disks, processed by separate CPUs, and which happen at exactly the same time?
So primary keys are just arbitrary data indicating uniqueness alone. And that's good, because if they didn't, there would be much more inter-transaction contention, and one couldn't utilize ones multi-server multi-disk multi-CPU environment: the speed of the system would be limited to the speed of the slowest transaction.