DiffPak v0.1 by Bartosz Zaborowski

Download sources

About:

A differential compresor for huge files. Unlike other tools, e.g. xdelta3, it searches for matching data through the whole source file, even if it weights several gigabytes, while using much less memory (with default configuration approx 25x less than the size of the source file). Output files are not compressed, so you can use any compressor you like with great results.
It is quite fast for very similar files (about the speed of hdd, however it reads input files twice) and much slower for different data (some 10-100x slower).

IMPORTANT: with files larger than 1GB it works only on 64-bit systems due to the use of mmap for input files.

Algorithm:

Algorithm is simple. It splits the source file into blocks (2KB by default). It memoizes hashes (truncated md5) of every block of source file, then tries to match blocks of data from dest file with those hashes. If it finds a match - comparison goes in the file-file level until it cannot be done and output is a largest common block starting from a hash-matched block. If it doesn't find - it moves one byte and retries hash-matching, until a match is found. Hence, a smallest noncommon block folowed by a hash-matched block can be 1 byte long. I think this is very similar algorithm to a usual repetition matching.

Instalation:

make && sudo make install

Optional CXXFLAG -D BSIZE=n changes the default block size. Higher block size will increase speed in most of the cases and decrease memory usage always, but will cause slightly bigger output files. Lower block size will cause a bit better results at the cost of much higher memory usage and sometimes slower operation.

Example:
CXXFLAGS="-D BSIZE=4096" make && sudo make install

Default is 2048b.

License:

GNU GPL version 3 or later (http://gnu.org/licenses/gpl.html)

Feel free to contact me at bzaborow [the at sign] tlen.pl