How to e.g. change Java package name, over all revisions, while preserving Git history
Recently a customer requested that a piece of software should "move companies" from company A to company B, henceforth appearing as if it'd always been created by company B, and never appear to have been produced by company A at all. (I would surprise you if I say this is not the first time I've received such a request.. or not?)
The simplest way would have been to rename everything in the source code, copy it to a new directory, git init
, check it in, throw away the old repository.
I'm a big fan of history in version control systems. Is there a way to change the company's name while preserving history?
After a while of intense Gitting, this is what I came up with. I publish it here in the hope that it will be useful to someone :-)
git filter-branch -f --prune-empty --tree-filter '
find . \
-name .git -prune -o \
-exec sh -c "file {} | grep -q text" \; \
-exec sed -i "" \
-e "s/Old Company/New Company/g" \
-e "s/com.oldcompany/com.newcompany/g" \
{} \; \
&& mv src/main/java/com/oldcompany src/main/java/com/newcompany
'
WARNING: This command was written on a Mac. Take out the ""
after the -i
option if you're on Linux.
This is what's going on:
The
git filter-branch --tree-filter="cmd"
does the following: It goes through your repository one revision at a time. For each revision, it checks that revision out to a brand new clean directory. It runs yourcmd
on that directory. Then it looks at that directory, and any files in it, that's what the new revision will be. So afterwards you have all the revisions you had before, but each one has been processed/changed.Any files it finds are in the new revision. So you can delete files, rename files, change files, and so on. So there is no need to
git add
orgit rm
any files..gitignore
is not respected, but this isn't usually a problem as ignored files weren't checked in in the first place, and thus won't be checked out to the temporary directory.Git, as ever, only deals with files, so any empty directories will just get ignored.
The
sed
command processes files,-i ""
means "in-place" and with "no backup" file. (You don't want to write a backup file, as any backup files would get added to the new revision.) Mac requires the next argument aftersed -i
be the extension, on Linux it requires the extension not be the next argument and instead be part of the command. So there's no way to create a command which works on both systems. Slow clap to UNIX.Each
-e
command is a regular expression over the file. You can add as many as you like.You don't want to go around sedding binary files and corrupting them, so
-exec sh -c "file {} | grep -q text" \;
makes sure that the file is plain text. (The manual page forfile
confirms the output always includes the texttext
somewhere, if it's not a binary file.) This is perhaps the coolest part; as thefile
command is being passed as a string to thefind
command, and the wholefind
command is being passed as a string to the--tree-filter
option. On one side it's cool that such a thing is possible. On the other hand passing commands to other commands as strings isn't a very scalable approach: in the sense that thefile
command has a "depth" of 3 (called by a command which is called by a command) but imagine a "depth" of 100. Imagine if we programmed Javascript callbacks by passing the function to be executed as a string, and escaping"
by\"
in strings, then escaping them as\\\"
in the next level, and so on. How far would we get? Not far, I assert. The UNIX philosophy, which Git embraces, is powerful, but I can't help thinking there must be a better syntax for software composition than treating commands as strings. (For example, syntax used by any other programming language.)It is useful to process all text files, not just a whitelist of extensions such as *.java. I was surprised at the number of build files, Vagrantfiles, Dockerfiles, .classpath IDE files, documentation .md files etc which references the name of the company.
You have used Word for documentation and thus can't process those files automatically? Build a time machine, go back in time to the point in time you thought that would work out well, punch yourself in the face, tell yourself to use LyX or (for short documents only) Markdown. (You'll also thank yourself when you create a feature branch and want to update your doc to document the feature. My current project has a 100+ page LyX doc and it works and merges perfectly. LyX also doesn't cost anything and works on all platforms. Perhaps the coolness of LyX requires a further blog post as opposed to languishing in obscurity in these brackets here.)
The regular expression is
s/old/new/g
meaning replaceold
bynew
on any line. (That's an approximation; they're regular expressions so they're more complex than that.)The default in regular expressions is to process every line, but only replace once per line. This is perhaps not the best default? How often do you want the left-hand item to be replaced but not any other items to the right of that most left-hand item? The
g
modifier at the end of the regex replaces not only every line, but every occurrence within every line.The
{}
characters are replaced by the current file being processed byfind
. Thankfully this also happens in the middle of command-strings, the Mac man page ominously mentions that this is "in contrast to other versions of find". But the Linux man page says the same (albeit without the ominous warning).The
&&
between the commands means the process should abort in case of an error. One might foolishly imagine this would be the default way of joining statements instead of;
. I can't imagine why you'd ever want "ignore errors and carry on anyway", let alone it be the default. Can you imagine how bad it would be if a Java exception caused the rest of the code to execute anyway? Related: Don't doCREATE TABLE b AS SELECT * FROM a; DROP TABLE a;
You'll have an error in the create statement and the drop statement will run anyway and you'll lose all your data and have to restore from backup and then get a bit stressed when it turns out your ops partner just took a copy of the hard disk's files and not a consistent DB dump despite your explicit instructions, the customer will call you every 5 minutes for a status update, and so on.If you have multiple branches you have to add them all at the end of the command. In my case I only have
master
so that's the default.If you have a single branch and mess up you can
git reset --hard origin/master
. If you have multiple branches you have to do that once per branch. Or maybe the best is just to check out a clean "clone" and just delete it when things go wrong. (Things will go wrong..)
Good luck with your filtering. What could go wrong?