removing invalid XML characters from a string in java

Question

Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.

like

line.replace(regExp,"");

what is the right regExp to use ?

invalid XML character is everything that is not this :

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

thanks.

It depends on what you want to replace. What is "invalid XML character"? — khachik
– khachik, Commented Nov 21, 2010 at 11:39
Why do you think that characters in that range are invalid for XML? You can use [^\u0001-\uD7FF\uE000-\uFFFD] to match 2-byte unicode chars out of the range (needs to be checked, I'm not sure about the syntax). Don't know anything about 24 bit chars, sorry. — khachik
– khachik, Commented Nov 21, 2010 at 12:03
found the valid XML characters here: w3.org/TR/2006/REC-xml11-20060816/#NT-RestrictedChar — yossi
– yossi, Commented Nov 21, 2010 at 12:19

k314159 · Accepted Answer · 2024-08-01 21:21:28Z

94

Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars, or, even easier, use \x to specify any valid code point.

Here is the pattern for removing characters that are illegal in XML 1.0:

// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\x{10000}-\x{10FFFF}"
                    + "]";

Most people will want the XML 1.0 version.

Here is the pattern for removing characters that are illegal in XML 1.1:

// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
                    + "\u0001-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\x{10000}-\x{10FFFF}"
                    + "]+";

You will need to use String.replaceAll(...) and not String.replace(...).

String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");

edited Aug 1, 2024 at 21:21

k314159

12.4k2 gold badges29 silver badges72 bronze badges

answered Nov 21, 2010 at 12:58

McDowell

109k31 gold badges207 silver badges272 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

evgenyl Over a year ago

The link is broken, the right one seems to be: oracle.com/technetwork/articles/javase/…

evgenyl Over a year ago

May by I am wrong, but this ranges will NOT remove characters like \b (\u0008), and so on. But this chars will also break the xml marshaling. Can you also please hint about your' comment for answer with Mark McLaren's Weblog? Thank you!

Cjxcz Odjcayrwl Over a year ago

The \ud800\udc00-\udbff\udfff syntax was at first very misleading for me, it's just that Java Regex engine interprets that pair as single character, am I right?

McDowell Over a year ago

@ŁukaszL. Correct. The UTF-16 sequence D800 DC00 is code point U+10000, DBFF DFFF is U+10FFFF, and Java's regex engine respects surrogate pairs.

Redtopia Over a year ago

Doh! I thought they were the illegal chars. I guess it's equivalent, but do you think there's any advantage to matching only the illegal chars? ([#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]) - w3.org/TR/2006/REC-xml11-20060816/#NT-RestrictedChar

|

Nicholas DiPiazza · Accepted Answer · 2017-07-20 19:14:36Z

13

All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have  in your xml, a java xml parser will throw Illegal character entity: expansion character (code 0x2 at ....

Here is a simple java program that can replace those invalid entity sequences.

  public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\\&\\#(?:x([0-9a-fA-F]+)|([0-9]+))\\;");

  /**
   * Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.
   */
  String getCleanedXml(String xmlString) {
    Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);
    Set<String> replaceSet = new HashSet<>();
    while (m.find()) {
      String group = m.group(1);
      int val;
      if (group != null) {
        val = Integer.parseInt(group, 16);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#x" + group + ";");
        }
      } else if ((group = m.group(2)) != null) {
        val = Integer.parseInt(group);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#" + group + ";");
        }
      }
    }
    String cleanedXmlString = xmlString;
    for (String replacer : replaceSet) {
      cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");
    }
    return cleanedXmlString;
  }

  private boolean isInvalidXmlChar(int val) {
    if (val == 0x9 || val == 0xA || val == 0xD ||
            val >= 0x20 && val <= 0xD7FF ||
            val >= 0x10000 && val <= 0x10FFFF) {
      return false;
    }
    return true;
  }

edited Jul 20, 2017 at 19:14

answered Jul 20, 2017 at 18:55

Nicholas DiPiazza

10.7k14 gold badges102 silver badges175 bronze badges

3 Comments

Matze.N Over a year ago

This was indeed the right answer for me. I was converting a JSONObject to XML which escaped control chars from "\u0001" to "". This code perfectly removed it.

Vanja D. Over a year ago

Only this solution removes special characters before deserialization (i.e. sanitizes a malformed XML). In my case, the generated XML that I was parsing contained illegal XML 1.0 characters, which I needed to strip before parsing the XML. Thank you sir.

Vanja D. Over a year ago

@nicholas I've adapted your code and created this Java util method to strip/replace invalid characters from malformed XMLs.

Jun · Accepted Answer · 2012-07-27 05:27:38Z

11

Should we consider surrogate characters? otherwise '(current >= 0x10000) && (current <= 0x10FFFF)' will never be true.

Also tested that the regex way seems slower than the following loop.

if (null == text || text.isEmpty()) {
    return text;
}
final int len = text.length();
char current = 0;
int codePoint = 0;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < len; i++) {
    current = text.charAt(i);
    boolean surrogate = false;
    if (Character.isHighSurrogate(current)
            && i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) {
        surrogate = true;
        codePoint = text.codePointAt(i++);
    } else {
        codePoint = current;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.append(current);
        if (surrogate) {
            sb.append(text.charAt(i));
        }
    }
}

edited Jul 27, 2012 at 5:27

answered Jul 26, 2012 at 15:31

Jun

1111 silver badge3 bronze badges

1 Comment

Martynas Jusevičius Over a year ago

So what is this code doing - removing the illegal characters? How about a function that is replacing them with a different char? :)

Vlasec · Accepted Answer · 2015-02-02 17:33:18Z

3

Jun's solution, simplified. Using StringBuffer#appendCodePoint(int), I need no char current or String#charAt(int). I can tell a surrogate pair by checking if codePoint is greater than 0xFFFF.

(It is not necessary to do the i++, since a low surrogate wouldn't pass the filter. But then one would re-use the code for different code points and it would fail. I prefer programming to hacking.)

StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
    int codePoint = text.codePointAt(i);
    if (codePoint > 0xFFFF) {
        i++;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.appendCodePoint(codePoint);
    }
}

answered Feb 2, 2015 at 17:33

Vlasec

5,5634 gold badges30 silver badges33 bronze badges

2 Comments

Vlasec Over a year ago

I got downvoted apparently. I would like to know why. It could just be someone trolling me, but if there is something wrong about the algorithm, I'd like to know.

petrsyn Over a year ago

Do you know how to construct a string that contains invalid Unicode char above the max. 0x10FFFF codepoint? The 0x10FFFF shoud correspond to Java string "\udbff\udfff". I tried to construct invalid char 0x110000 which should be Java string "\udbff\ue000". But Java parses this as 2 codepoints. Therefore the last check (codePoint <= 0x10FFFF) seems can't be tested / is useless in real life as Java seems to never return it from the codePointAt().

Hans Schreuder · Accepted Answer · 2018-01-29 08:18:56Z

2

String xmlData = xmlData.codePoints().filter(c -> isValidXMLChar(c)).collect(StringBuilder::new,
                StringBuilder::appendCodePoint, StringBuilder::append).toString();

private boolean isValidXMLChar(int c) {
    if((c == 0x9) ||
       (c == 0xA) ||
       (c == 0xD) ||
       ((c >= 0x20) && (c <= 0xD7FF)) ||
       ((c >= 0xE000) && (c <= 0xFFFD)) ||
       ((c >= 0x10000) && (c <= 0x10FFFF)))
    {
        return true;
    }
    return false;
}

edited Jan 29, 2018 at 8:18

answered Jan 23, 2018 at 9:03

Hans Schreuder

7435 silver badges10 bronze badges

Comments

Renaud · Accepted Answer · 2012-06-05 09:20:15Z

1

From Mark McLaren's Weblog

  /**
   * This method ensures that the output String has only
   * valid XML unicode characters as specified by the
   * XML 1.0 standard. For reference, please see
   * <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
   * standard</a>. This method will return an empty
   * String if the input is null or empty.
   *
   * @param in The String whose non-valid characters we want to remove.
   * @return The in String, stripped of non-valid characters.
   */
  public static String stripNonValidXMLCharacters(String in) {
      StringBuffer out = new StringBuffer(); // Used to hold the output.
      char current; // Used to reference the current character.

      if (in == null || ("".equals(in))) return ""; // vacancy test.
      for (int i = 0; i < in.length(); i++) {
          current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
          if ((current == 0x9) ||
              (current == 0xA) ||
              (current == 0xD) ||
              ((current >= 0x20) && (current <= 0xD7FF)) ||
              ((current >= 0xE000) && (current <= 0xFFFD)) ||
              ((current >= 0x10000) && (current <= 0x10FFFF)))
              out.append(current);
      }
      return out.toString();
  }

answered Jun 5, 2012 at 9:20

Renaud

16.6k7 gold badges83 silver badges81 bronze badges

5 Comments

Cjxcz Odjcayrwl Over a year ago

@McDowell could you elaborate what is not covered and why? It's basically the same range as in Jun's answer, which was not downvoted by you.

McDowell Over a year ago

@ŁukaszL. This code tests UTF-16 code units. Jun's code converts to and tests 32-bit code points. For example, the code point U+1D50A is in the supported range 0x10000-0x10FFFF. It must be represented as a surrogate pair in UTF-16 - e.g. the literal "\uD835\uDD0A". The above algorithm will incorrectly drop anything represented by surrogate pairs. See the code point methods on the Character type.

Cjxcz Odjcayrwl Over a year ago

@McDowell I was using the code above, so please tell if I've understood that correctly, I should drop the range 0x10000-0x10FFFF from that code. Instead I should do check Character.isHighSurrogate(current). If so, I should check if next character is Character.isLowSurrogate() and only then add both. "\uD801\uDC00" is a correct Unicode character, while "\uDC00\uD801" is not?

McDowell Over a year ago

@ŁukaszL. That will work. See also here. Also, correct, \uDC00\uD801 is not meaningful data since the pair is backwards - corrupt data.

Cjxcz Odjcayrwl Over a year ago

@McDowell thanks. I've updated my code and made a JUnit test. However, since the question is actually about regex, it's not proper to post here, and it's already similar to Jun's answer.

Community · Accepted Answer · 2017-05-23 12:10:33Z

0

From Best way to encode text data for XML in Java?

String xmlEscapeText(String t) {
   StringBuilder sb = new StringBuilder();
   for(int i = 0; i < t.length(); i++){
      char c = t.charAt(i);
      switch(c){
      case '<': sb.append("&lt;"); break;
      case '>': sb.append("&gt;"); break;
      case '\"': sb.append("&quot;"); break;
      case '&': sb.append("&amp;"); break;
      case '\'': sb.append("&apos;"); break;
      default:
         if(c>0x7e) {
            sb.append("&#"+((int)c)+";");
         }else
            sb.append(c);
      }
   }
   return sb.toString();
}

edited May 23, 2017 at 12:10

CommunityBot

11 silver badge

answered Nov 10, 2015 at 16:43

Roger F. Gay

2,0012 gold badges21 silver badges25 bronze badges

2 Comments

jediz Over a year ago

No. How can one state enumerating chars one by one as the best way I don't get it.

Roger F. Gay Over a year ago

There is no alternative to checking them one-by-one. If you use other methods, then the methods must do it - somebody has to. You risk additional overhead if the other method in less efficient.Writing fewer lines in your application isn't the same thing as having the most efficiently running code..

Roger F. Gay · Accepted Answer · 2017-04-07 13:09:41Z

0

If you want to store text elements with the forbidden characters in XML-like form, you can use XPL instead. The dev-kit provides concurrent XPL to XML and XML processing - which means no time cost to the translation from XPL to XML. Or, if you don't need the full power of XML (namespaces), you can just use XPL.

Web Page: HLL XPL

answered Apr 7, 2017 at 13:09

Roger F. Gay

2,0012 gold badges21 silver badges25 bronze badges

Comments

AlexR · Accepted Answer · 2010-11-21 12:26:00Z

-1

I believe that the following articles may help you.

http://commons.apache.org/lang/api-2.1/org/apache/commons/lang/StringEscapeUtils.html http://www.javapractices.com/topic/TopicAction.do?Id=96

Shortly, try to use StringEscapeUtils from Jakarta project.

answered Nov 21, 2010 at 12:26

AlexR

116k16 gold badges137 silver badges216 bronze badges

1 Comment

McDowell Over a year ago

I do not see how this helps the original poster - the problem is that there is a range of characters that just cannot be encoded in XML. These must be handled before you attempt to encode your character data.

Collectives™ on Stack Overflow

removing invalid XML characters from a string in java

9 Answers 9

13 Comments

3 Comments

1 Comment

2 Comments

Comments

5 Comments

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

13 Comments

3 Comments

1 Comment

2 Comments

Comments

5 Comments

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related