ZipFile works unexpected since jdk1.2 - Diagnosis and Solutions

Introduction

You can't get the contents of a ZipEntry which name contains 'umlauts' (precisely any char>127) with Java 1.2 - 1.4.
It's mostly all right with 1.1.8.

Symptoms

If you let ZipFile enumerate the entries, you get all names maybe wrong encoded, but you get them. If you try to get a ZipInputStream from such an entry (with chars>127 in the name), ZipFile returns null. There is no chance.
If i write about chars in names most probably i mean those non Ascii chars: ö ß ë ... .

Diagnosis

My first (wrong) idea - some years ago - was that Sun have more than one idea how to code filenames in zips.
After investigation of the native code i found, that they interpret the names in the zip as utf-8 encoded. The bytes are transformed into a String nevertheless if they are valid utf-8 or not. When you request an InputStream for an entry, the wrong/not encoded Java-String is transformed to utf-8 bytes. The native code searches for an entry with a name that matches these 'bytes'. These bytes differ from those in the zip 'cause the transformations
   byte(utf-8)->String   and   String->byte(utf-8)
are irreversible for invalid utf-8.

An extra difficulty (i saw in windows): if you have names with chars which are encoded in iso8859-1 format (e.g. info-zip creates those) and others which are encoded as utf-8, those ZipEntries may have the same (Java String) names, so you can get a Stream only for one.

Solutions

I have 3 ones, the first is the best.
The others are real hacks. I describe them (the second one is working for years) 'cause one may need those ideas in other places.

Solution 1

Use the pure-java-zip classes, e.g. from jazzlib.sourceforge.net they are free for use in every product, also in non GPL ones. Be aware of some bugs, and there are differences to the jdk classes too. Some differences: entries are enumerated in a random way; the ZipEntry class handles some extrainfos.
This is the best solution. You can interpret the name-bytes as you like.

Performance

I tested only the decompression, which needed around 2.5 times longer than the native implementation (Jre1.4.1/Win and jazzlib 0.0.6). Get the test code here. I found jazzlib/gcj-native-compiled another 2.5 times slower than jazzlib/Jre1.4.1.

Mixed use

You can use the jazzlib classes to read the zip-structure and the native zlib from the JDK to decompress. This way you get best performance and most tested code.

Here are the needed changes to use the native zlib. Just insert some import statements and change the deprecated API call that converts name (byte[]) to String. Put the three files 'ZipConstants.java ZipEntry.java ZipFile.java' in one folder (you may change package names) and add these imports:

//ZipEntry.java
import java.util.zip.ZipOutputStream;

//ZipFile.java
import java.util.zip.Inflater;
import java.util.zip.InflaterInputStream;
import java.util.zip.ZipException;
import java.util.zip.ZipOutputStream;

In the description of java.util.zip.Inflater the need of a dummy byte at the end of the input stream is noted, because of a difference between gzip and pkzip. I tried the modified jazzlib ZipFile without, but you may have a closer look at. You can use the internal class PKZipInputStream of the file ZipPatch.java from the hacking solution 3 below.

Solution 2

When opening a zip put all filenames in a hashmap, where the value is the 'index' from the native-zip-entry (look into the sources). Do this in the constructor. Now just overwrite ZipFile.getEntry(), there lookup the entry in the hashmap, done! Ups, to much private.

Recipe

Hacking ZipFile. You have to patch java.util.zip.ZipFile. The changes follow bellow. After compiling the resulting jar in the BootClassPath. e.g.

'java ... -Xbootclasspath/p:zippatch1.3.jar ...'

A special classloader would be nice, but to load java.* packages is forbidden.

Side effects

Performance may decrease with big archives. It is not known to me that any bug appears, but you never know.

Patch

Here are the three places to change. I tested it with 1.2 up to 1.4.

1. Add a hashtable/map (HashMap isn't synchronized)
public class ZipFile implements ZipConstants {
    private java.util.Hashtable entries;

2. Insert this Block at the end of
    public ZipFile(File file, int mode) throws IOException { // JDK 1.3, 1.4
    public ZipFile(String name) throws IOException {         // JDK 1.2 
...	

	entries = new java.util.Hashtable((int)(this.total*1.1), 0.95f);
	for (int i = 0;i < total;i++){
		long jzentry = getNextEntry(jzfile, i);
		if (jzentry == 0) throw new InternalError("jzentry == 0");

		ZipEntry ze = new ZipEntry(jzentry);
		entries.put(ze.name, new Integer(i));
		freeEntry(jzfile, jzentry);            // not in jdk1.2.2
		}


3. Replace method getEntry(long, String)
//    private static native long getEntry(long jzfile, String name);
    private long getEntry(long jzfile, String name){
	Integer i = (Integer)entries.get(name);
	if (i==null) return 0L;

	long jzentry = getNextEntry(jzfile, i.intValue());
//as you like:	if (jzentry == 0) {throw new InternalError("jzentry == 0");}
	return jzentry;
	}

Solution 3

Is nearly the same as Nr2. But we don't change ZipFile, we extend it and get access to the private methods/fields/classes with reflection. So we can use one class for all JDKs and don't need the BootClassPath. It may not work if security is involved - so it surely doesn't work in an applet. ZipPatch.java

© July 2003 Peter Büttner