A certain firewall I know habitually chops off the last few bytes of some responses. It’s usually noticeable when large binaries are downloaded. Installers may fail to run (fortunately most of them contain an integrity check), and ZIP files may refuse to open. In the old DOS days, one could attempt a repair of such a ZIP file with a program called PKZIPFIX. Most modern archive managers do not include repair functionality. (It should be noted that repair can take two approaches: single-bit repairs using redundant recovery information; and salvaging of the unaffected files in an archive. Some archive formats, such as RAR, can carry recovery information, but only the latter approach is applicable to ZIP files.)
Fortunately, the last few bytes of a ZIP file aren’t actually all that important. A ZIP file consists of two parts:
- A sequence of individually compressed files, each with its own header.
- A “central directory”, repeating all the header information in a compact format.
In the case of my broken download, a 30 MB archiving containing 3000 files has about 1% overhead for the central directory. So up to 300k can be chopped off without affecting the actual data. (And, in theory, the ZIP file could have been another 1% smaller.)
The purpose of the central directory is to provide an efficient means to list the contents of the file, and to extract any individual file. It is also treated by modern archive managers as the first thing to be read when a ZIP file is opened; if it’s not intact, the archive is deemed to be unrecognisable.
Even when the central directory is corrupted, all the file data is still there. The only problem is that the programs for extracting it rely on the central directory, rather than the file headers, to locate it.
A look at the code for Python’s
zipfile module confirms that it uses the central directory to build a list of files in the archive. However, it also contains useful snippets of code for parsing the file headers individually. With access to zipfile’s internal structures, and a bit of copying and pasting, we can write a simple loop to process the files in order, without reading the central directory.
while True: # Read and parse a file header (copied from zipfile) fheader = f.read(zipfile.sizeFileHeader) fheader = struct.unpack(zipfile.structFileHeader, fheader) fname = f.read(fheader[zipfile._FH_FILENAME_LENGTH]) if fheader[zipfile._FH_EXTRA_FIELD_LENGTH]: f.read(fheader[zipfile._FH_EXTRA_FIELD_LENGTH]) print 'Found %s' % fname # Fake a zipinfo record zi = zipfile.ZipInfo() zi.compress_size = fheader[zipfile._FH_COMPRESSED_SIZE] # Read the file contents and save to disk zef = zipfile.ZipExtFile(f, zi) data = zef.read() write_data(fname, data)
This is a bit hacky, so a few sanity checks are in order. Check that the uncompressed data has the length recorded in the header. Compute its CRC32 code, and compare that with the header, too. We also want to confirm that all the files were recovered. So there is a successful termination condition: if the start of the central directory is read, then we know all files have been processed. If we reach the end of the archive without encountering it, then it’s likely that some files at the end of the archive are lost.
The full (but still extremely hacky!) program is in my project repository on Google Code.
Of course, my program only fixes ZIP files where the only damage is to the central directory. My understanding of tools like PKZIPFIX is that it would scan the entire archive for plausible file headers (those indicated by the signature
PK\x03\x04). That is more exhaustive, but it would not be too hard to implement it.