31

Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.

like

line.replace(regExp,"");

what is the right regExp to use ?

invalid XML character is everything that is not this :

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

thanks.

5
  • 1
    It depends on what you want to replace. What is "invalid XML character"? Commented Nov 21, 2010 at 11:39
  • you are right i have added the information Commented Nov 21, 2010 at 11:48
  • Why do you think that characters in that range are invalid for XML? You can use [^\u0001-\uD7FF\uE000-\uFFFD] to match 2-byte unicode chars out of the range (needs to be checked, I'm not sure about the syntax). Don't know anything about 24 bit chars, sorry. Commented Nov 21, 2010 at 12:03
  • 1
    found the valid XML characters here: w3.org/TR/2006/REC-xml11-20060816/#NT-RestrictedChar Commented Nov 21, 2010 at 12:19
  • Neat solution stackoverflow.com/a/9635310/489364 Commented May 20, 2013 at 8:42

9 Answers 9

94

Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars, or, even easier, use \x to specify any valid code point.

Here is the pattern for removing characters that are illegal in XML 1.0:

// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\x{10000}-\x{10FFFF}"
                    + "]";

Most people will want the XML 1.0 version.

Here is the pattern for removing characters that are illegal in XML 1.1:

// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
                    + "\u0001-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\x{10000}-\x{10FFFF}"
                    + "]+";

You will need to use String.replaceAll(...) and not String.replace(...).

String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");
Sign up to request clarification or add additional context in comments.

13 Comments

The link is broken, the right one seems to be: oracle.com/technetwork/articles/javase/…
May by I am wrong, but this ranges will NOT remove characters like \b (\u0008), and so on. But this chars will also break the xml marshaling. Can you also please hint about your' comment for answer with Mark McLaren's Weblog? Thank you!
The \ud800\udc00-\udbff\udfff syntax was at first very misleading for me, it's just that Java Regex engine interprets that pair as single character, am I right?
@ŁukaszL. Correct. The UTF-16 sequence D800 DC00 is code point U+10000, DBFF DFFF is U+10FFFF, and Java's regex engine respects surrogate pairs.
Doh! I thought they were the illegal chars. I guess it's equivalent, but do you think there's any advantage to matching only the illegal chars? ([#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]) - w3.org/TR/2006/REC-xml11-20060816/#NT-RestrictedChar
|
13

All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have  in your xml, a java xml parser will throw Illegal character entity: expansion character (code 0x2 at ....

Here is a simple java program that can replace those invalid entity sequences.

  public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\\&\\#(?:x([0-9a-fA-F]+)|([0-9]+))\\;");

  /**
   * Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.
   */
  String getCleanedXml(String xmlString) {
    Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);
    Set<String> replaceSet = new HashSet<>();
    while (m.find()) {
      String group = m.group(1);
      int val;
      if (group != null) {
        val = Integer.parseInt(group, 16);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#x" + group + ";");
        }
      } else if ((group = m.group(2)) != null) {
        val = Integer.parseInt(group);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#" + group + ";");
        }
      }
    }
    String cleanedXmlString = xmlString;
    for (String replacer : replaceSet) {
      cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");
    }
    return cleanedXmlString;
  }

  private boolean isInvalidXmlChar(int val) {
    if (val == 0x9 || val == 0xA || val == 0xD ||
            val >= 0x20 && val <= 0xD7FF ||
            val >= 0x10000 && val <= 0x10FFFF) {
      return false;
    }
    return true;
  }

3 Comments

This was indeed the right answer for me. I was converting a JSONObject to XML which escaped control chars from "\u0001" to "&#x1;". This code perfectly removed it.
Only this solution removes special characters before deserialization (i.e. sanitizes a malformed XML). In my case, the generated XML that I was parsing contained illegal XML 1.0 characters, which I needed to strip before parsing the XML. Thank you sir.
@nicholas I've adapted your code and created this Java util method to strip/replace invalid characters from malformed XMLs.
11

Should we consider surrogate characters? otherwise '(current >= 0x10000) && (current <= 0x10FFFF)' will never be true.

Also tested that the regex way seems slower than the following loop.

if (null == text || text.isEmpty()) {
    return text;
}
final int len = text.length();
char current = 0;
int codePoint = 0;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < len; i++) {
    current = text.charAt(i);
    boolean surrogate = false;
    if (Character.isHighSurrogate(current)
            && i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) {
        surrogate = true;
        codePoint = text.codePointAt(i++);
    } else {
        codePoint = current;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.append(current);
        if (surrogate) {
            sb.append(text.charAt(i));
        }
    }
}

1 Comment

So what is this code doing - removing the illegal characters? How about a function that is replacing them with a different char? :)
3

Jun's solution, simplified. Using StringBuffer#appendCodePoint(int), I need no char current or String#charAt(int). I can tell a surrogate pair by checking if codePoint is greater than 0xFFFF.

(It is not necessary to do the i++, since a low surrogate wouldn't pass the filter. But then one would re-use the code for different code points and it would fail. I prefer programming to hacking.)

StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
    int codePoint = text.codePointAt(i);
    if (codePoint > 0xFFFF) {
        i++;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.appendCodePoint(codePoint);
    }
}

2 Comments

I got downvoted apparently. I would like to know why. It could just be someone trolling me, but if there is something wrong about the algorithm, I'd like to know.
Do you know how to construct a string that contains invalid Unicode char above the max. 0x10FFFF codepoint? The 0x10FFFF shoud correspond to Java string "\udbff\udfff". I tried to construct invalid char 0x110000 which should be Java string "\udbff\ue000". But Java parses this as 2 codepoints. Therefore the last check (codePoint <= 0x10FFFF) seems can't be tested / is useless in real life as Java seems to never return it from the codePointAt().
2
String xmlData = xmlData.codePoints().filter(c -> isValidXMLChar(c)).collect(StringBuilder::new,
                StringBuilder::appendCodePoint, StringBuilder::append).toString();

private boolean isValidXMLChar(int c) {
    if((c == 0x9) ||
       (c == 0xA) ||
       (c == 0xD) ||
       ((c >= 0x20) && (c <= 0xD7FF)) ||
       ((c >= 0xE000) && (c <= 0xFFFD)) ||
       ((c >= 0x10000) && (c <= 0x10FFFF)))
    {
        return true;
    }
    return false;
}

Comments

1

From Mark McLaren's Weblog

  /**
   * This method ensures that the output String has only
   * valid XML unicode characters as specified by the
   * XML 1.0 standard. For reference, please see
   * <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
   * standard</a>. This method will return an empty
   * String if the input is null or empty.
   *
   * @param in The String whose non-valid characters we want to remove.
   * @return The in String, stripped of non-valid characters.
   */
  public static String stripNonValidXMLCharacters(String in) {
      StringBuffer out = new StringBuffer(); // Used to hold the output.
      char current; // Used to reference the current character.

      if (in == null || ("".equals(in))) return ""; // vacancy test.
      for (int i = 0; i < in.length(); i++) {
          current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
          if ((current == 0x9) ||
              (current == 0xA) ||
              (current == 0xD) ||
              ((current >= 0x20) && (current <= 0xD7FF)) ||
              ((current >= 0xE000) && (current <= 0xFFFD)) ||
              ((current >= 0x10000) && (current <= 0x10FFFF)))
              out.append(current);
      }
      return out.toString();
  }   

5 Comments

@McDowell could you elaborate what is not covered and why? It's basically the same range as in Jun's answer, which was not downvoted by you.
@ŁukaszL. This code tests UTF-16 code units. Jun's code converts to and tests 32-bit code points. For example, the code point U+1D50A is in the supported range 0x10000-0x10FFFF. It must be represented as a surrogate pair in UTF-16 - e.g. the literal "\uD835\uDD0A". The above algorithm will incorrectly drop anything represented by surrogate pairs. See the code point methods on the Character type.
@McDowell I was using the code above, so please tell if I've understood that correctly, I should drop the range 0x10000-0x10FFFF from that code. Instead I should do check Character.isHighSurrogate(current). If so, I should check if next character is Character.isLowSurrogate() and only then add both. "\uD801\uDC00" is a correct Unicode character, while "\uDC00\uD801" is not?
@ŁukaszL. That will work. See also here. Also, correct, \uDC00\uD801 is not meaningful data since the pair is backwards - corrupt data.
@McDowell thanks. I've updated my code and made a JUnit test. However, since the question is actually about regex, it's not proper to post here, and it's already similar to Jun's answer.
0

From Best way to encode text data for XML in Java?

String xmlEscapeText(String t) {
   StringBuilder sb = new StringBuilder();
   for(int i = 0; i < t.length(); i++){
      char c = t.charAt(i);
      switch(c){
      case '<': sb.append("&lt;"); break;
      case '>': sb.append("&gt;"); break;
      case '\"': sb.append("&quot;"); break;
      case '&': sb.append("&amp;"); break;
      case '\'': sb.append("&apos;"); break;
      default:
         if(c>0x7e) {
            sb.append("&#"+((int)c)+";");
         }else
            sb.append(c);
      }
   }
   return sb.toString();
}

2 Comments

No. How can one state enumerating chars one by one as the best way I don't get it.
There is no alternative to checking them one-by-one. If you use other methods, then the methods must do it - somebody has to. You risk additional overhead if the other method in less efficient.Writing fewer lines in your application isn't the same thing as having the most efficiently running code..
0

If you want to store text elements with the forbidden characters in XML-like form, you can use XPL instead. The dev-kit provides concurrent XPL to XML and XML processing - which means no time cost to the translation from XPL to XML. Or, if you don't need the full power of XML (namespaces), you can just use XPL.

Web Page: HLL XPL

Comments

-1

I believe that the following articles may help you.

http://commons.apache.org/lang/api-2.1/org/apache/commons/lang/StringEscapeUtils.html http://www.javapractices.com/topic/TopicAction.do?Id=96

Shortly, try to use StringEscapeUtils from Jakarta project.

1 Comment

I do not see how this helps the original poster - the problem is that there is a range of characters that just cannot be encoded in XML. These must be handled before you attempt to encode your character data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.