2

We are fetching XML from one source and then passing onto another entity for further processing. However, the fetched XML contains special characters in the attribute value which are not acceptable to the next process. For e.g.

Sample Input :

"<Message text="<html>Welcome User, <br> Happy to have you. <br>.</html>"

Expected Output:

"<Message text="&lt;html&gt;Welcome User, &lt;br&gt; Happy to have you. &lt;br&gt;.&lt;/html&gt;">

Sample Input : <Message text="<html>Welcome User, <br> Happy to have you. </html>" Multi="false"> <Meta source="system" dest="any"></Meta></Message>

Output: <Message text="&lt;html&gt;Welcome User, &lt;br&gt; Happy to have you. &lt;/html&gt;" Multi="false"> <Meta source="system" dest="any"></Meta></Message>

But the <br> won't be replaced in case the input has multiple <br> tags.

We are using following code :

String xml = "<Message text=\"<html>Welcome User, <br> Happy to have you. <br>.</html>\" Multi=\"false\"><Meta source=\"system\" dest=\"any\"></Meta></Message>";
System.out.println("ORG:" + xml);
xml = replaceChars(xml);
System.out.println("NEW:" + xml);

private static String replaceChars(String xml)
        {
           xml = xml.replace("&", "&amp;");
           xml = xml.replaceAll("\"<([^<]*)>", "\"&lt;$1&gt;");
            xml = xml.replaceAll("</([^<]*)>\"", "&lt;/$1&gt;\"");
            xml = xml.replaceAll("\"([^<]*)<([^<]*)>([^<]*)\"", "\"$1&lt;$2&gt;$3\"");

            return xml;
        }
2
  • We are not parsing the xml. We just want to remove those characters due to which it's not parsing by SAX parser in the next stage. Commented Jul 5, 2018 at 12:20
  • Does this answer your question? removing invalid XML characters from a string in java Commented Sep 1, 2020 at 11:58

3 Answers 3

2

To match you can use regular expression:

(?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>)

  • (?:<) Match but don't capture <.
  • (?<=<) Positive lookbehind for <.
  • (\/?\w*) Capture tag name. Optional / and word characters.
  • (?=.*(?<=<\/html)) Positive lookahead, then positive lookbehind for closing tag.
  • (?:>) Match but don't capture >.

To replace you can use:

  • &lt;$1&gt;

Where $1 is the result of the capture group in the regular expression. You can test the regular expression interactively here.

Using the following Java code:

 public static void main(String []args){
    String xml = "<Message text=\"<html>Welcome User, <br> Happy to have you. <br>.</html>\" Multi=\"false\"><Meta source=\"system\" dest=\"any\"></Meta></Message>";
    String newxml = replaceChars(xml);
    System.out.println(newxml);
 }

 private static String replaceChars(String xml)
    {
       xml = xml.replaceAll("(?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>)", "&lt;$1&gt;");
       return xml;
    }

The output is:

"<Message text="&lt;html&gt;Welcome User, &lt;br&gt; Happy to have you. &lt;/html&gt;" Multi="false"> <Meta source="system" dest="any"></Meta></Message>"

Sign up to request clarification or add additional context in comments.

4 Comments

It is partially correct. The whole output is : <Message text="&lt;html&gt;Welcome User, &lt;br&gt; Happy to have you. &lt;br&gt;.&lt;/html&gt;" Multi="false"> <Meta source="system" dest="any">&lt;/Meta&gt;&lt;/Message&gt; Observe the closing tags for Meta and Message. Basically, we would want to consider only those content which is between ""(double quotes).
@Chota Right, I get you. Please try (?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>) here. Let me know and I will update my answer.
Yes this is better but this seems to expect that it would always end with </html> which is actually not the case. We might have string which just has some <br> in it.
It is trivial to add additional cases to the second lookbehind for tags you know you will want to match, i.e. (?<=<\/html|\/br)
2

Please do not use regular expressions to escape special characters in XML.

Can you guarantee that this will work for all possible html input with all of HTML and XML quirks (very extensive specs!!!) ?

Just use one of many utilities out there to escape XML strings.

Apache Commons is quite popular - please see this example

Comments

1

XML is not text. In fact, XML documents are a binary format.

Processing XML as text is the wrong approach, and only works in simple cases. Things to consider:

  • The XML document has no file encoding, but content encoding specified IN the document (thus it must be read by an XML parser, which correctly handles this).
  • XML documents use XML entities (built-ins like &amp;, &lt;, &gt; and &quot;, other can be arbitrarily defined in DDL, see https://www.w3resource.com/xml/entities.php).
  • XML document can contain CDATA

Therefore:

  • use a proper XML parser to read documents
  • perform manipulations (text replacement, add/remove nodes) on the DOM (document object model) or streaming model.
  • use a proper XML processor to write documents

By the way, the XML in your example is NOT xml (malformed as no entities are used for <, >, ")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.