3

I have XML files (encoded in UTF-8) that have two issues:

  • Some of them (not all) contain a Byte order mark EF BB BF

  • Some of them (not all) contain Null characters 00, distributed over the whole file.

Both issues prevent me from parsing the XML with a SAX Parser. My current approach was to read the file into a String and use regex in order to extract these characters and write the string back to a file, which worked fine. However my files are quite large (hundreds of Megabytes) and reading the file into a String an creating a result String of the same size every time I call a replaceAll(), quickly leads to a java heap space error.

Increasing the heap size is definitely not a long term solution. I will need to stream the file and extract all these character on the fly.

Any suggestions on how an efficient solution should look like?

3 Answers 3

7

I would subclass FilterInputStream to filter out the undesired bytes at runtime.

The task should be rather easy as byte order marks are probably only at the start of the file (so you only need to check there) and nul-bytes can easily be flter with a simple == comparison (no need for regex-like features).

This will most likely also increase performance as you don't need to write out the full corrected file to disk before re-reading it.

Sign up to request clarification or add additional context in comments.

3 Comments

+1 I would do the same. And if there was a need to also correct the files on disk, this same filter could be used to copy from a FileInputStream to a FileOutputStream in reasonable chunks (or use IOUtils from Apache Commons IO).
Would you rather subclass FilterInputStream or BufferedInputStream, assuming I'm reading from disk.
@Will: I'd extend FilterInputStream for enhanced versatility: Whether or not to filter is a separate decision and if you extended BufferedInputStream you'd force it on the user of the class. The other way he could choose to wrap use a BufferedInputStream or not.
1

Why don't you filter the data as you read it into the SAX parser. This way you won't need to re-write the file. You can override the read() methods of FilterInputStream to drop the bytes you don't want.

I think that is what @Joachim is suggesting. ;)

Comments

1

I only concentrated on the BOM, seeing the issue with the null bytes too late. I still post it as an addition in case someone has a problem with BOMs only. Please be kind with respect to downvotes. :)


You could read the first three bytes with an InputStream that supports mark() and reset(), read the first three bytes and reset if they were not a BOM:

InputStream in = new BufferedInputStream(
        new FileInputStream(new File("xmlfile.xml")));
in.mark(3);
byte[] maybeBom = new byte[] {
        (byte) in.read(), (byte) in.read(), (byte) in.read() };

if(!Arrays.equals(maybeBom, new byte[] { (byte) 0xEF, (byte) 0xBB, (byte) 0xBF })) {
    in.reset();
}

I use BufferedInputStream because FileInputStream does not support mark().

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.