E-mail recovery

What I did this weekend.

I set up an IMAP to GMail system for someone, just like I use. This was working pretty well until he started running out of drive space. E-mail is one of those things that just accumulates, and if you send and receive large attachments it accumulates quite quickly. In our system this is exacerbated by the fact that a GMail labels are treated as IMAP folders; a message will appear in Thunderbird mail files for the labels it has, as well as in Inbox/Sent, and All Mail.

To free up some space he deleted 12,000 older emails from Thunderbird. Unfortunately it was set with something ominously called “auto-expunge”! The result was that when he expunged the items locally, they disappeared from GMail too! This was my fault for not explaining that that’s how it worked (or not anticipating this risk and disabling auto-expunge). It was therefore up to me to make an honest attempt at recovering them.

Of course, since he was trying to reclaim space he’d immediately emptied the Trash folder. So they weren’t there. Nor was there any trace in any of GMail’s folders, Trash included.

There was still hope: Thunderbird stores mail in mailbox files. When a message is deleted, it is marked deleted, but not immediately removed. In fact, deleted messages aren’t removed until the folder is compressed, when the remaining messages are copied to a new mailbox file which replaces the old, bloated one.

Sure enough, the All Mail file was still a couple of GB. Compressing it (after making a backup of it!) reduced it to 200 MB. Counting the From headers showed about 14000 messages in the original and 2000 after compression.

The first step was resetting the flag that indicates a deleted item. I’m not the first person to try to do this. There are detailed instructions for several methods. I picked the method of editing the X-Mozilla-Status headers. I could open a small mailbox in a text editor and use search-and-replace. But for a file this size I needed a different tool. Unix’s sed command seemed to be the trick; I’m an absolute novice with it but managed to come up with:

sed -e 's/X-Mozilla-Status: 1001/X-Mozilla-Status: 0000/g' All\ Mail > All\ Mail2

(I first needed to verify that all the X-Mozilla-Status lines indicated the same status, which they did, luckily. Otherwise I’d have had to put some more complicated regexps in that command!)

Checking the file in Thunderbird shows 14000 messages. But we can’t just replace it into the Mail folders, for two reasons: new mail has arrived in the meantime; and, in any case, I’m not sure manually restoring it in Thunderbird means it will propagate back into GMail too. I wanted to load it into GMail (having done that before with my own mail).

The next problem is that several of the messages in the recovered file weren’t deleted, and still exist in the Mail folders. I needed to extract only the messages that weren’t in the Mail any more. The algorithm for doing this is essentially to pull out the Message-ID header of each side and match them up. Again this is hard to do with such large files. And mail parsing is complicated; processing that much mail we’re bound to encounter several edge cases that only a mature mail parsing library will handle.

Last year I was working on a project to load mail into a database, using Python’s email module. That module could probably be used to implement the algorithm directly; but I had already written a program called import_mail that would import the mail into a reasonably well-normalised database. Once it was there, the mail could be checked and processed using SQL.

I ran the script. The process died with a Python MemoryError exception: it couldn’t handle files that big! I wrote a small script to split the file into even chunks, using the From header. (This is a special header that indicates the start of each message in the file.)

Once the mail was loaded, another problem was apparent. The script had generated new Message-ID headers for several messages, because it didn’t recognise existing ones. What’s more, it had saved these into the original file, which was unexpected! So unexpected, in fact, that it didn’t occur to me to make a backup of the input file. (I wrote that script; it was a long time ago but I think the justification is that the next time a mailbox is imported, any generated headers should be reused so that duplicates aren’t created.)

It turns out it did not replace the original Message-IDs; it just appended a new header to each message. I’m not sure why it didn’t recognise them the first time. It’s something to do with how message headers are supposed to be case-insensitive, I think.

So, once the spurious headers were removed, I could try to find the overlap between the two sets of mail. This was promising: most of the smaller file’s messages were in the recovered file, confirming my guess of what was going on. Several new messages were present in the smaller file; these could be ignored. And some messages had no original ID at all! Short of comparing on other fields (which I’m far too lazy to do), there is no way to match these. So if a message with no ID is present on both sides, a duplicate will appear. But that’s only a serious problem if you have a large number of duplicates.

The messages whose IDs appeared in the set from the smaller file could be deleted. But the remaining messages had duplicates; that’s because some of them had labels and appeared in multiple Thunderbird folders — and that meant they appeared multiple times in All Mail. The duplicates were checked for identical content, and all but one of each set deleted.

This left the promised 12,000 recovered messages in the database. I had a corresponding program called export_mail. In the absence of the kind of database hackery I was doing today, this could recreate any mailbox file that had been loaded into the database, almost identically (the only discrepancies being hidden content sandwiched between message parts). I ran it and it generated a several GB file, as expected.

The idea was to upload it with the GMail Loader, which I’d used last time. This can load a file into either Inbox or Sent. The export_mail program was therefore rerun, filtering on the sender of each message. For good measure I also split the received messages file; loading a 1.5 GB file in one go might strain the system a bit.

As of now the mail is not loaded yet, as I don’t have the password to the user’s account. I will update this post when we have confirmation that this recovery plan has worked!

This entry was posted in Programming and tagged , , , , . Bookmark the permalink.

Leave a comment