Java regex alternative for bytes on stream

Question

I have XML files (encoded in UTF-8) that have two issues:

Some of them (not all) contain a Byte order mark EF BB BF
Some of them (not all) contain Null characters 00, distributed over the whole file.

Both issues prevent me from parsing the XML with a SAX Parser. My current approach was to read the file into a String and use regex in order to extract these characters and write the string back to a file, which worked fine. However my files are quite large (hundreds of Megabytes) and reading the file into a String an creating a result String of the same size every time I call a replaceAll(), quickly leads to a java heap space error.

Increasing the heap size is definitely not a long term solution. I will need to stream the file and extract all these character on the fly.

Any suggestions on how an efficient solution should look like?

Joachim Sauer · Accepted Answer · 2011-05-04 09:52:44Z

7

I would subclass FilterInputStream to filter out the undesired bytes at runtime.

The task should be rather easy as byte order marks are probably only at the start of the file (so you only need to check there) and nul-bytes can easily be flter with a simple == comparison (no need for regex-like features).

This will most likely also increase performance as you don't need to write out the full corrected file to disk before re-reading it.

answered May 4, 2011 at 9:52

Joachim Sauer

309k59 gold badges568 silver badges624 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

WhiteFang34 Over a year ago

+1 I would do the same. And if there was a need to also correct the files on disk, this same filter could be used to copy from a FileInputStream to a FileOutputStream in reasonable chunks (or use IOUtils from Apache Commons IO).

Will Over a year ago

Would you rather subclass FilterInputStream or BufferedInputStream, assuming I'm reading from disk.

Joachim Sauer Over a year ago

@Will: I'd extend FilterInputStream for enhanced versatility: Whether or not to filter is a separate decision and if you extended BufferedInputStream you'd force it on the user of the class. The other way he could choose to wrap use a BufferedInputStream or not.

Peter Lawrey · Accepted Answer · 2011-05-04 09:58:42Z

1

Why don't you filter the data as you read it into the SAX parser. This way you won't need to re-write the file. You can override the read() methods of FilterInputStream to drop the bytes you don't want.

I think that is what @Joachim is suggesting. ;)

answered May 4, 2011 at 9:58

Peter Lawrey

535k83 gold badges770 silver badges1.2k bronze badges

Comments

musiKk · Accepted Answer · 2011-05-04 10:08:09Z

I only concentrated on the BOM, seeing the issue with the null bytes too late. I still post it as an addition in case someone has a problem with BOMs only. Please be kind with respect to downvotes. :)

You could read the first three bytes with an InputStream that supports mark() and reset(), read the first three bytes and reset if they were not a BOM:

InputStream in = new BufferedInputStream(
        new FileInputStream(new File("xmlfile.xml")));
in.mark(3);
byte[] maybeBom = new byte[] {
        (byte) in.read(), (byte) in.read(), (byte) in.read() };

if(!Arrays.equals(maybeBom, new byte[] { (byte) 0xEF, (byte) 0xBB, (byte) 0xBF })) {
    in.reset();
}

I use BufferedInputStream because FileInputStream does not support mark().

Collectives™ on Stack Overflow

Java regex alternative for bytes on stream

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related