Store data files in one big directory
A server filesystem can store millions of files in one directory. Unless you need to store more files than this, there is no need to artificially introduce an extra layer of directories to keep the number of files per directory down to a lower number. We kept millions of files in one directory at uboot.
Often one needs to store lots of files in a filesystem. For example, at uboot.com each user has a "nickpage", and each of these nickpages requires a file, and there are about 4M users.
It is tempting to use a hierarchical directory structure for this, for example take the last two digits of the nickpage-ID and create directories for each value, then in each of these directories create directories for the 3rd/4th last digits of the nickpage-ID, and in that store the nickpage file, for example the 4,000,001st nickpage might be stored in a file like /var/nickpages/01/00/4000001.xml
.
But this decision, seemingly so obvious, contains an implicit assumption, which is wrong. It assumes one cannot store more than a few hundred files in a directory.
In 2000 a consultant for tru64 UNIX told us that one should store no more than one million files in a directory. We stored (I think) about 0.5M files per directory, and it worked fine. We had over 1M page impressions per day. Modern day hardware is much faster (CPUs, disks).
- tru64 used a tree structure for the relationship between filenames and the information about the file, similar to a normal index in a database table. (And database tables clearly support having more than a few hundred rows in a table!)
- Solaris ZFS uses a hash structure to identify files. This is even quicker in some respects (as the tree doesn't have to be traversed) and it requires fewer locks (as operations on two files do not have any common tree parent nodes) although no doubt re-hashing will have to be done if the directory expands beyond a certain limit (in contrast to the tree-approach).
I think one needs to do the following things with a directory containing data files:
- Find a particular file, in which case it's easier in terms of programming, and faster for the OS to deal with a flat structure, than with a structure with lots of intermediate directories.
- Do an operation on all files, for example search for particular content within all files. In that case the time is spent looking into the files, the structure of the files in the directory does not matter.
Having worked with directory structures both in terms of programming and in terms of operations and live bug-fixing, I can say that it really is simpler to have simple directory structures, and really does work in production. Being able to vi id
or new File(directory, id+".xml")
is simpler.
Using intermediate directories is really just doing in programming what the OS does for you anyway.
P.S. I like padding IDs with zeros for example 0004000001.xml
, this means that files are always listed in numerical order if you sort them alphabetically. Although I assert this is something one rarely wants to do – takes a long time if you store files flat, and isn't possible at all if you store files hierarchically.