Zlib has been the gold standard for lossless compression almost since its introduction in 1995. Much of this success is probably due its open source status and wide portability—from Palm Pilots to IBM mainframes, you’ll find zlib version for every platform. Furthermore, unlike the patented LZW algorithm, zlib compression is guaranteed to never require more space than the original data.
You may wonder whether you still need compression nowadays with 300 Gb hard drives everywhere. Consider that compression is more than just saving some space on the hard drive. Applications that use the Internet, such as instant messaging and real-time gaming, benefit from compression immediately because you can never have too much bandwidth. You also can leave some types of data that are accessed sequentially in-memory as compressed items, thereby saving another precious resource. Lastly, any decent compression system includes Cyclic Redundancy Check (CRC32), which provides yet another layer of defense against data communication or I/O errors.
How Shall I Compress Thee? Let Me Count the Ways
Although at the bottom-most layer zlib is implemented as a straightforward C-style interface, many people have improved zlib’s accessibility over the years by offering alternative interfaces. If you use Visual Basic, you might find it easier to use the zlib ActiveX (OCX) interface developed by Mark Nelson. C++ purists may want to try the Gzstream Library that mimics the familiar iostream interfaces. Going further afield, further ports towards Perl, Python, Tcl, and several implementations in Java are available, all of which are linked from the zlib home page.
For simplicity’s sake, this tutorial covers only the basic C-style interface. All the concepts inherent there will be relevant to most other bindings.
Zlib Utility Functions
You can write three different styles of applications using zlib: utility I/O functions, in-memory compression, and basic functions. This article examines only the utility I/O functions in depth because of their inherent simplicity. The utility functions are the highest level of API. Their purpose is to behave the same as regular functions except that they do compression behind the scenes. They mimic the C-style FILE objects by providing a replacement “gz” function for each traditional stdio.h function. For example, gzopen() replaces open(), gzwrite() replaces write(), and so on. As such, you must expect gzwrite() to return the number of uncompressed bytes written as opposed to the number of actual compressed bytes written.
The basic functions allow fine-grained control over memory allocation, as well as enabling you to examine Adler32 and CRC32 checksums on a block-by-block basis, set a compression level on a scale from 1 to 10, and control the compression dictionaries. As mentioned previously, these options are outside the scope of this article.
The remainder of this article presents the shortest possible program that compresses a file into a second file, and then it demonstrates uncompressing that second file into a third output file (which hopefully matches the original).
To get started, you’ll need a copy of zlib 1.2.3 from the zlib home page. The sample program to follow uses the Win32 precompiled binaries for simplicity. If you’re working a decently configured Linux system, you probably have headers in /usr/include/zlib.h.
1 #include <stdio.h> 2 #include "/zlib/include/zlib.h" 3 4 // Demonstration of zlib utility functions 5 6 unsigned long file_size(char *filename) 7 { 8 FILE *pFile = fopen(filename, "rb"); 9 fseek (pFile, 0, SEEK_END); 10 unsigned long size = ftell(pFile); 11 fclose (pFile); 12 return size; 13 } 14 15 int decompress_one_file(char *infilename, char *outfilename) 16 { 17 gzFile infile = gzopen(infilename, "rb"); 18 FILE *outfile = fopen(outfilename, "wb"); 19 if (!infile || !outfile) return -1; 20 21 char buffer[128]; 22 int num_read = 0; 23 while ((num_read = gzread(infile, buffer, sizeof(buffer))) > 0) { 24 fwrite(buffer, 1, num_read, outfile); 25 } 26 27 gzclose(infile); 28 fclose(outfile); 29 } 32 33 int compress_one_file(char *infilename, char *outfilename) 34 { 35 FILE *infile = fopen(infilename, "rb"); 36 gzFile outfile = gzopen(outfilename, "wb"); 37 if (!infile || !outfile) return -1; 38 39 char inbuffer[128]; 40 int num_read = 0; 41 unsigned long total_read = 0, total_wrote = 0; 42 while ((num_read = fread(inbuffer, 1, sizeof(inbuffer), infile)) > 0) { 43 total_read += num_read; 44 gzwrite(outfile, inbuffer, num_read); 45 } 46 fclose(infile); 47 gzclose(outfile); 48 49 printf("Read %ld bytes, Wrote %ld bytes, Compression factor %4.2f%%n", 50 total_read, file_size(outfilename), 51 (1.0-file_size(outfilename)*1.0/total_read)*100.0); 52 } 53 54 55 int main(int argc, char **argv) 56 { 57 compress_one_file(argv[1],argv[2]); 58 decompress_one_file(argv[2],argv[3]); 59 }
The sample program takes three arguments. The first argument is the original uncompressed file; the second is the new compressed file; and the third (just for demonstration purposes) is the uncompressed version of the file you just compressed. For example, you might run it as follows:
zlibtest verybigfile.dat verybigfile.z verybigfile2.dat
And then you can do a comparison to convince yourself that the compress/uncompress step has not hurt your data. The compress_one_file (lines 33-52) portion uses a rather small buffer (128 bytes). In practice, you could have easily fit the whole thing in memory, but the point is simply to show how it would work piecewise.
To compile this program, you type in the following:
cl /TP zlibtest.c /link /DEBUG d:zliblibzdll.lib
Note that zlib1.dll will need to be accessible in your path or the current directory.
If you give the program a whirl on the Constitution of the United States, the results are pretty good:
zlibtest constitution.txt constitution.z constitution2.txt Read 28365 bytes, Wrote 9196 bytes, Compression factor 67.58%
If you compare them by using InfoZIP with “-9” maximal compression, you see that in fact 68 percent is the best you can hope for on this text file.
Is It Safe Now?
You may have heard of zlib security vulnerabilities in the past. Because an application is only as secure as its weakest linked-in library, should you worry? No. All known security issues have been cured as of the latest zlib version 1.2.3. In July 2005, zlib 1.2.2 had a decompression buffer overflow vulnerability. Going back further to March 2005, zlib 1.2.0 had a denial of service (DoS) vulnerability. The previous year, a double free bug cropped up. Given that there are more than a thousand diverse applications using zlib (a few examples off the top of my head are the OpenSSH project, libPNG, and infoZIP, a popular free pkZIP replacement), even by conservative estimates you could argue it is probably one of the most severely tested libraries ever. Certainly, none of the developers of these widely used products, each of which has been distributed for more than a decade, has given up on zlib.
Compression Is Still Relevant
Until every filesystem and network pipe has transparent and seamless compression built in below the transport layers, you have plenty of good reasons to use compression in your apps, such as time, space, and therefore money. Zlib has stood the test of time, yet still remains one of the easiest third-party tools to integrate into your application suite. If you haven’t thought about compression lately, maybe it’s time to give zlib a shot.
About the Author
Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries, just drop an e-mail to sysop@HAL9K.com.