Replace xml special characters in Java String

Question

We are fetching XML from one source and then passing onto another entity for further processing. However, the fetched XML contains special characters in the attribute value which are not acceptable to the next process. For e.g.

Sample Input :

"<Message text="<html>Welcome User, <br> Happy to have you. <br>.</html>"

Expected Output:

"<Message text="&lt;html&gt;Welcome User, &lt;br&gt; Happy to have you. &lt;br&gt;.&lt;/html&gt;">

Sample Input : <Message text="<html>Welcome User, Happy to have you. </html>" Multi="false"> <Meta source="system" dest="any"></Meta></Message>

Output: <Message text="<html>Welcome User, Happy to have you. </html>" Multi="false"> <Meta source="system" dest="any"></Meta></Message>

But the   won't be replaced in case the input has multiple   tags.

We are using following code :

String xml = "<Message text=\"<html>Welcome User, <br> Happy to have you. <br>.</html>\" Multi=\"false\"><Meta source=\"system\" dest=\"any\"></Meta></Message>";
System.out.println("ORG:" + xml);
xml = replaceChars(xml);
System.out.println("NEW:" + xml);

private static String replaceChars(String xml)
        {
           xml = xml.replace("&", "&amp;");
           xml = xml.replaceAll("\"<([^<]*)>", "\"&lt;$1&gt;");
            xml = xml.replaceAll("</([^<]*)>\"", "&lt;/$1&gt;\"");
            xml = xml.replaceAll("\"([^<]*)<([^<]*)>([^<]*)\"", "\"$1&lt;$2&gt;$3\"");

            return xml;
        }

We are not parsing the xml. We just want to remove those characters due to which it's not parsing by SAX parser in the next stage. — Chota Bheem
– Chota Bheem, Commented Jul 5, 2018 at 12:20
Does this answer your question? removing invalid XML characters from a string in java — Martin Schröder
– Martin Schröder, Commented Sep 1, 2020 at 11:58

Paolo · Accepted Answer · 2018-07-05 14:49:18Z

2

To match you can use regular expression:

(?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>)

(?:<) Match but don't capture <.
(?<=<) Positive lookbehind for <.
(\/?\w*) Capture tag name. Optional / and word characters.
(?=.*(?<=<\/html)) Positive lookahead, then positive lookbehind for closing tag.
(?:>) Match but don't capture >.

To replace you can use:

<$1>

Where $1 is the result of the capture group in the regular expression. You can test the regular expression interactively here.

Using the following Java code:

 public static void main(String []args){
    String xml = "<Message text=\"<html>Welcome User, <br> Happy to have you. <br>.</html>\" Multi=\"false\"><Meta source=\"system\" dest=\"any\"></Meta></Message>";
    String newxml = replaceChars(xml);
    System.out.println(newxml);
 }

 private static String replaceChars(String xml)
    {
       xml = xml.replaceAll("(?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>)", "&lt;$1&gt;");
       return xml;
    }

The output is:

"<Message text="<html>Welcome User, Happy to have you. </html>" Multi="false"> <Meta source="system" dest="any"></Meta></Message>"

edited Jul 5, 2018 at 14:49

answered Jul 5, 2018 at 12:57

Paolo

26.6k8 gold badges51 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Chota Bheem Over a year ago

It is partially correct. The whole output is :

<Message text="&lt;html&gt;Welcome User, &lt;br&gt; Happy to have you. &lt;br&gt;.&lt;/html&gt;" Multi="false"> <Meta source="system" dest="any">&lt;/Meta&gt;&lt;/Message&gt;

Observe the closing tags for Meta and Message. Basically, we would want to consider only those content which is between ""(double quotes).

Paolo Over a year ago

@Chota Right, I get you. Please try (?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>) here. Let me know and I will update my answer.

Chota Bheem Over a year ago

Yes this is better but this seems to expect that it would always end with </html> which is actually not the case. We might have string which just has some   in it.

Paolo Over a year ago

It is trivial to add additional cases to the second lookbehind for tags you know you will want to match, i.e. (?<=<\/html|\/br)

diginoise · Accepted Answer · 2018-07-10 15:15:40Z

2

Please do not use regular expressions to escape special characters in XML.

Can you guarantee that this will work for all possible html input with all of HTML and XML quirks (very extensive specs!!!) ?

Just use one of many utilities out there to escape XML strings.

Apache Commons is quite popular - please see this example

edited Jul 10, 2018 at 15:15

answered Jul 5, 2018 at 14:10

diginoise

7,6612 gold badges35 silver badges45 bronze badges

Comments

Peter Walser · Accepted Answer · 2018-07-05 14:29:27Z

XML is not text. In fact, XML documents are a binary format.

Processing XML as text is the wrong approach, and only works in simple cases. Things to consider:

The XML document has no file encoding, but content encoding specified IN the document (thus it must be read by an XML parser, which correctly handles this).
XML documents use XML entities (built-ins like &, <, > and ", other can be arbitrarily defined in DDL, see https://www.w3resource.com/xml/entities.php).
XML document can contain CDATA

Therefore:

use a proper XML parser to read documents
perform manipulations (text replacement, add/remove nodes) on the DOM (document object model) or streaming model.
use a proper XML processor to write documents

By the way, the XML in your example is NOT xml (malformed as no entities are used for <, >, ")

Collectives™ on Stack Overflow

Replace xml special characters in Java String

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related