Unicode characters in file names
It's amazing that the following all works:
- Using my terminal program (PuTTY on Windows, set to UTF-8) connected to a Linux computer (terminal settings set to UTF-8), created a file whose name had various Unicode characters
-
The shell allowed me to type those characters (vi
) - The standard programs such as "ls", "cat", "vi" seemed to be able to handle these file names
- I checked the file into the Subversion version control system – it worked [1]
- On Windows (XP, NTFS) I checked the file out using Tortoise SVN, it worked.
- Windows Explorer showed the file having the Unicode characters in its file name.
- I opened the file in Windows Notepad, it opened the file and displayed the name correctly in the title bar
That means, for my uses, I can absolutely use Unicode characters in file names. That's a cool situation.
(In this particular situation, the user of my program should choose a "report" from a drop-down of possible report types. Each report has a directory on the disk, with some files in a standard layout inside the directory. There is no additional data, the file names do not need to be localized, so rather than creating an extra config file, which could get out-of-date with the directories on the disk, it is much more convenient and normalized to simply scan the directory from the program. The reports are created on Windows, my program is running on Linux, and the communication between the two is the Subversion VCS.)
There was one slight problem, which I didn't notice at first, which is that perl can't read the Unicode file names correctly on Linux. I didn't notice it because, as is often the case with character set situations, there were two errors which cancelled on another out, to make it look like it had worked. Perl read the file name thinking that each UTF-8 two-byte character was actually two characters, and by default outputted Latin1 even though the terminal was UTF-8 so the two "characters" were output and interpreted by the terminal as the single original character in the file name. In such situations, I find the only way to debug and test such things is to output the length in characters as a number, as then such cancelling-out errors cannot occur.
[1] If you're experiencing problems on the unix command line, using a UTF-8 terminal try the command "export LANG=en_US.UTF-8"!