On “we don’t need these users—let’s move them to an archive table!”
For one of the customers I currently work for, when we first designed the platform in Q1/2000, there was the "account" table, there we stored our users. There were always various pressures to move "inactive" users to a separate "archive" table. I was always against this decision.
In Q4/2005, during a period of my absence, it was decided to implement this decision. A bunch of users were to be deleted, but "not quite", in case we needed their data again. Their data was to be moved from the "account" table to an "account_archive" table.
This was really the worst decision ever made. I said that before, and now I see the consequences. I want anyone who considers such an operation good, to understand the consequences. So I list them here.
- More and more, bosses and business people require we do operations on "all" users, which includes the "account_archive" table. This generally involves a "union" of both tables.
- Now I have to create a real-time data interface to a slave system. This also including archived users. That means I have a "who has changed" table (input queue for the process exporting changed users to the slave system). This table references account_ids, but I can't create an FK from this table to "account", as sometimes an "account_id" references "account" and sometimes "account_archive".
- There are classes which model a User, and this uses the "account" table as the underlying table. This enables me to build logic functions on the User class, and this has been done. However, at the time the class was built, there was only "account", so I can't use this class to model users who are stored in the "account_archive" table. (I'm not going to extend the User object to include the "account_archive" table, that will make this critical code too complex)
- Now I have to allow users to "unsubscribe" from a newsletter, and "archive" people can receive newsletters, if they elected to receive them while they were active. Again, I can't use the User objects to do that. So I have to just program in plain SQL in an fcgi (or create a second class MaybeArchivedUser to model a user which could be in either table, and then duplicate some instance methods – that's what I chose to do).
- It was suggested "maybe we archived the wrong users". But it's nearly impossible to re-create them as the schema is different, and some information has not been kept on in "account_archive". Their nicknames, which are unique amongst active users (but not amongst archive users) might have been reused in the meantime.
It would have been necessary to decide on one of the two courses of action:
- We actually will never need these users again: we delete them
- We might or do need them in the future. In which case we set a special "status" in the "account" table. They can't log in. But we can build User objects. We can re-enable them if necessary. We can even let them log in to some mini-platform where they can do a few things such as delete themselves or request their reactivation.
It has been said that removing users from account "increases performance". However:
- It's more probable that two accounts will be read from the same disk block, after a defragmentation has occurred (did any defragmentation run? I don't think so)
- If there are half the number of users, that's one less binary-index level. If there are 2M users that's 21 index branches. If 1M users that's 20 index branches. Hardly a big saving.
- Although backup (and recovery) no doubt became quicker
- This hardly makes up for the other disadvantages
Splitting a table up into two tables, for "performance", or whatever reason, is never a good thing to do. Add a status flag.