Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars, or, even easier, use \x to specify any valid code point.
Here is the pattern for removing characters that are illegal in XML 1.0:
// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\x{10000}-\x{10FFFF}"
+ "]";
Most people will want the XML 1.0 version.
Here is the pattern for removing characters that are illegal in XML 1.1:
// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
+ "\u0001-\uD7FF"
+ "\uE000-\uFFFD"
+ "\x{10000}-\x{10FFFF}"
+ "]+";
You will need to use String.replaceAll(...) and not String.replace(...).
String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");
[^\u0001-\uD7FF\uE000-\uFFFD]to match 2-byte unicode chars out of the range (needs to be checked, I'm not sure about the syntax). Don't know anything about 24 bit chars, sorry.