COUNT(*)
vs COUNT(pk_col)
A while back I was doing some performance tuning on MySQL 5 for a customer. A SELECT was counting the number of rows in the table. I always use COUNT(*) for that but I know a lot of people, including the customer, use COUNT(pk_col). The query was taking a long time (a few minutes). I analyzed that the problem came from the usage of COUNT(pk_col) instead of COUNT(*). With COUNT(*) it was instantaneous.
I didn't know that there was a difference between the two. There is no difference in their semantics, therefore it didn't occur to me that there might be a difference in the way they were executed.
Just to recap the SQL syntax:
- COUNT(DISTINCT col) counts the number of distinct values that are in "col"
- COUNT(col) counts the number of rows where col is not null.
- COUNT(*) counts the number of rows in a table
It follows that for columns which don't have any "null" values in them, COUNT(col) must be the same as COUNT(*). For any column marked "not null", such as a primary key, this is always the case. And thus it follows that COUNT(pk_col) must always deliver an identical result to COUNT(*).
However, the database in question was executing a COUNT(pk_col) query and a COUNT(*) query differently.
- For the COUNT(*) queries it was simply counting the number of rows in the table (taking constant time and not requiring reading the rows from disk)
- For the COUNT(pk_col) it was going through all the rows, presumably checking if the pk_col was in fact null, and counting the number of rows where it wasn't, thus requiring a lot of disk access.
Alas I didn't write down the output from the EXPLAIN statement and I've just tried it on some other databases (using MySQL 4.1 and 5.0) and was unable to reproduce this behaviour. So this is a bit of a useless blog entry! Nevertheless – you must believe me – it did happen!
So the conclusion is, one should always use COUNT(*) and never COUNT(pk_col).