I have XML files (encoded in UTF-8) that have two issues:
Some of them (not all) contain a Byte order mark EF BB BF
Some of them (not all) contain Null characters 00, distributed over the whole file.
Both issues prevent me from parsing the XML with a SAX Parser. My current approach was to read the file into a String and use regex in order to extract these characters and write the string back to a file, which worked fine. However my files are quite large (hundreds of Megabytes) and reading the file into a String an creating a result String of the same size every time I call a replaceAll(), quickly leads to a java heap space error.
Increasing the heap size is definitely not a long term solution. I will need to stream the file and extract all these character on the fly.
Any suggestions on how an efficient solution should look like?