Moving is a Pain

They say moving is one of the higher-stress activities we engage in. It’s certainly been that way for papyri.info, which is moving (along with me and Ryan) to DC3. Let me tell you about how I spent the latter part of my second week in DC3:

On Wednesday morning, I’d planned to get the last of papyri.info’s subsystems back online, namely the automated sync between the Papyrological Editor’s main Git repository and our Github repo. [Note: the domain switchover hasn’t happened yet, so the new system is temporarily at http://dc3-01.lib.duke.edu/.] We get changes to content in the Papyrological Navigator in two ways: first as edits and new submissions by community members via the Editor, which are voted on and then committed to the “canonical” repo; second as fixes pushed directly into our Github repo. The latter are usually things like patches for encoding errors or large-scale markup updates that affect many documents—they generally aren’t scholarly/editorial changes, but rather technical upgrades or fixes. The syncing subsystem does an hourly merge of updates from the canonical and github repos, and then publishes those changes into the PN. I’d waited until everything else seemed to be in good working order before turning it on.

Before deploying the sync application, I began by doing a manual merge. The PN data is in a repo named idp.data, so to do the merge, I first ran a git pull github master (master being the local name for idp.data), then a git pull canonical master. “Git pull” fetches changes from a remote repository and merges them into the current repo. Then I ran a git push github master, which pushed the merged changes back to Github, and finally a git push canonical master. This last triggered an automatic garbage collection routine—not an unusual occurrence, as Git likes to periodically compact the objects in a repo into archives called “packs”. This keeps repo size down and allows for faster access to objects. Git objects are mainly commits, blobs (files), and trees (directories). So I waited for the gc to complete, and at the end I got an error message. Something had gone wrong. I switched to the canonical repo and was dismayed to find that it was completely broken. Git repos keep their packs in the objects/pack/ directory. There were no packs in the canonical repo. The gc run had nuked it! It was completely useless in this state, and I didn’t have a backup of it in its most recent form. Fortunately, because I’d just pulled from it to idp.data, that repo had a copy of its state, and I could just make a clone of it and make a couple of configuration changes to fix it, but that didn’t change the fact that a runaway Git process had just destroyed it. What if that happened elsewhere? Automated garbage collections are triggered whenever Git feels a repo needs to be compacted.

I made a backup copy of my repaired canonical.git repo and ran a git gc --no-prune, manually invoking the process that had killed the old copy. Sure enough, git did a whole bunch of work (I could see using top that it was using a lot of CPU and memory) and then finished by deleting everything in the pack/ directory. Ugh. I took the editor offline for the rest of the day lest it clobber any user repos by accident (and hoped that hadn’t happened already). Then I started the process of figuring out what went wrong.

My first thought was that maybe there was something wrong with our install of Git—this would be pretty unlikely, as it was set up in a standard way, but it was clear that Git itself was doing the damage, so I had to start somewhere. Git is not a monolithic executable, it’s actually a collection of programs written in C, Perl, and shell script. When I called git gc, the git executable launched the git-gc executable (a C program), which launched a git-repack shell script, which in turn called a git-pack-objects executable (another C program). After poking around in the Git source code (which is hosted on Github), I was able to see what was going wrong: git-repack expects git-pack-objects to return a list of new pack files. It then moves those into place and deletes the old ones. In our case, git-pack-objects was returning nothing (not an expected result) and then git-repack was deleting all the old pack files, and with them most of the repo’s history. This did indeed look like a bug in Git! I was able to disable the automatic cleanup in git-repack by modifying the shell script, so we could bring the editor back online and not have people’s work destroyed by an automatic gc. Running strace on git-pack-objects showed that it was doing lots and lots of file accesses and then exiting normally, albeit without creating a pack file at the end.

Oh, did I mention that Josh was going to be demoing it at a conference on Thursday and Friday? No? Bad enough that papyrologists all over the world were waiting for me to fix this, but now I had an opportunity to make DC3 look bad in week 2 of its existence!

Ryan remembered that something like this had happened before, in the early days when SoSOL was being developed at UKY. It had something to do with the way our repos are configured and the way they were stored on the file system. In order to save space, certain user accounts are linked as “alternates” to the canonical repo. That is, they may contain data the repo needs to maintain its history. When the canonical repo does a garbage collection, it has to go look through all of these alternate repos for objects, which is why it’s such an intensive operation—it’s doing a lot of file I/O when it runs. At NYU, all of our data was located in a /data/papyri.info directory, and various parts of the application are configured to look there for files. When we moved to Duke, they configure their systems so that extra storage is mounted at /srv. So when I was setting up the application, I just symlinked /srv/data to /data, so we wouldn’t have to change any configuration or look for places where “/data/papyri.info” might be hard-coded. Maybe the symlink was causing the problem, so I changed all the alternates configuration files to the absolute locations of the alternate repos and tried again. Still no joy.

My breakthrough came when I was able to replicate the problem on my own laptop. It seems that if git-pack-objects is executed in our repos without an evironment variable named GIT_DIR (that is, the location of the repo git is supposed to be working on) properly set, it will do a lot of work, but exit without creating a new pack file. A bit more investigating and trying things out revealed that it worked if you’d cd’ed into the “real” directory (under /srv/data/papyri.info/…), but not if you’d gone to /data/papyri.info. I didn’t dig deep enough into the git-pack-objects code to know precisely where the bug is, but now I knew what was happening and how to fix it. I changed a couple of configuration files in the editor, rebuilt, and redeployed it. The fix took a minute or two, and the problem was all down to what directory I was working in!

Bug-fixing is often like this. It involves a combination of trying different approaches to make things work, working to isolate the source of the problem, working to understand what’s acutally happening, and finding a fix. It’s not uncommon for there to be time pressure as it’s happening. Therefore, you’re more focussed on getting things working again than finding out precisely what’s going on. Just doing things a different way may work. Maybe you scrap one program (or version of a program) in favor of another. Or maybe you have to patch an existing program. Or perhaps you have to dig down to the root cause to understand exactly what’s happening. It is a humbling, and often frightening experience, because you start out with no clue where to begin, and you may flail around cluelessly for a long time, and you probably have impatient, important people looking over your shoulder (perhaps literally). It is crucial to frame the things you do at this stage as experiments. Make backup copies, try things, roll them back to their original state when they fail. Don’t leave the cruft of your failed tests lying around to make things more confusing later on. Remember what you’ve tried and how. Stay calm.

A bug-hunting investigation is different from a scholarly investigation in that finding root causes is not your prime motivator. Making things work is. But you do have working hypotheses as you go, and you try to validate or invalidate those. You have to be careful to rely on the evidence and not to cling to hypotheses just because you like them nor to discard them just because you don’t. This week, my list of hypotheses was something like:

Our Git install is broken, or partially broken
We’re hitting some low level quirk of the OS or, worse, the Virtual Machine it runs on
Having the /data directory be a symlink is breaking everything
Git-pack-objects is broken
The GIT_DIR environment variable is getting clobbered, or is somehow unavailable to the child process git-pack-objects

In this case, I’ve read enough of the Git source code and experimented enough to understand what is happening, but not precisely why. Why could in theory lead to submitting a patch to Git, but I almost certainly won’t have time to dig that far. I do have a final hypothesis, that I’m not likely to have the chance to prove:

There’s nothing at all wrong with Git. Our chain of alternates is circular, since user repos all have canonical as an alternate, and the symlink thing means that git-pack-objects doesn’t realize the canonical alternate it finds in each user repo is in fact the same as its working repo. Therefore it thinks all the objects it might want to pack are already present somewhere else and it doesn’t bother to do anything.

I think this is almost certainly right and at most a fix to the Git code might entail making it a bit more careful about checking directory identity.

But in the end, knowing the what is good enough. I can work around it by just not using that symlink anymore, or by making certain that the GIT_DIR variable is always set. I am paranoid though, so the symlink will be removed forever, just in case. After all, the Application Developer’s Motto is “Nuke the site from orbit. It’s the only way to be sure.”

Leave a Reply Cancel reply

a collection of parts flying in loose formation