Reading a ZIP file in Java
Java provides the facility to read raw data compressed using
the DEFLATE algorithm using
a Deflater or DeflaterInputStream.
For many applications, another useful facility is that Java provides an API for
reading from (and writing to) unencrypted ZIP files. (Such support is inevitable, since the jar
(Java archive) file is essentially a ZIP file.) The ZIP file format packages a number
of files together into a single archive; individual subfiles within the archive are
compressed using the DEFLATE algorithm. Thus before reading data from the archive,
we need to specify which subfile we want to read.
Note that to read an encrypted ZIP file in Java, you'll need
Arcmexer or some other third party library.
But sticking with Java's built-in support for now,
let's consider the case where we want to read from a single file within
the ZIP archive1.
The basic pattern for doing so is as follows (for clarity, we'll
ignore exception handling code):
ZipFile zf = new ZipFile(file);
try {
InputStream in = zf.getInputStream("file.txt");
// ... read from 'in' as normal
} finally {
zf.close();
}
We can successively call getInputStream()
on any number of subfiles in the archive and read the corresponding data.
If you really desire, you can also hold open and call read methods concurrently on
different InputStreams from the same ZipFile, but the actual
reads are synchronized on the ZipFile object, so there'll only be one
actual read per zip file in progress at any one time. Given the nature of
what ZipFile does, that kind of makes sense.
Buffering and stream closing
In general, the flavour of InputStream returned by
ZipFile.getInputStream() can be treated as any old InputStream.
A couple of subtleties are:
- there is no need to call close() on the individual
InputStreams for subfiles, though you should close the ZipFile
as above;
- the single-byte read() method has particularly poor performance;
if you need single-byte reads on an InputStream from a zip file,
wrap it in a BufferedInputStream.
Of course, you should generally avoid unbuffered single-byte reads and writes;
I make the point simply because you might have expected the single-byte read to be reading
straight from a buffer, given the decompression process2.
Enumerating entries and metadata
The example above assumes that you want to read from a known file in the
zip archive. But what happens if you want to read from 'all' of the files,
or files matching a certain filter etc? On the next page, we look at
how to enumerate zip entries and
read their metadata via the ZipFile class.
Problems with ZIP files
There are at least a couple of problems with zip files
that you should be aware of. We'll loook at how these relate to Java specifically.
1. A slight anomaly is that the zip file data must physically be in
a file; we can't open a zip file from an arbitrary input stream.
2. In the current implementation, what actually happens is much worse: first,
a single-element byte array is constructed on each call to read(),
and then this is passed to an individual native method call each time.
Performance-wise, that's not good.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.