1

I am reading/parsing an XML file with javax.xml.stream.XMLStreamReader.
The file contains this piece of XML data as shown below.

<Row>
  <AccountName value="Paving 101" />
  <AccountNumber value="20205" />
  <AccountId value="15012" />
  <TimePeriod value="2019-08-20" />
  <CampaignName value="CMP Paving 101" />
  <CampaignId value="34283" />
  <AdGroupName value="residential paving" />
  <AdGroupId value="1001035" />
  <AdId value="790008" />
  <AdType value="Expanded text ad" />
  <DestinationUrl value="" />
  <BidMatchType value="Broad" />
  <Impressions value="1" />
  <Clicks value="1" />
  <Ctr value="100.00%" />
  <AverageCpc value="1.05" />
  <Spend value="1.05" />
  <AveragePosition value="2.00" />
  <SearchQuery value="concrete&#x19;driveway&#x19;repair&#x19;methods" />
</Row>

Unfortunately I am getting this error and I am not sure how to resolve it.

    Error in downloadXML: 
    com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x19
     at [row,col {unknown-source}]: [674,40]
        at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:606)
        at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:479)
        at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2448)
        at com.ctc.wstx.sr.StreamScanner.validateChar(StreamScanner.java:2395)
        at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1218)
        at com.ctc.wstx.sr.BasicStreamReader.parseAttrValue(BasicStreamReader.java:1929)
        at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3063)
        at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2961)
        at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2837)
        at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1072)

The problem seems to be with this character &#x19.
Of course I can first read the file simply as a text file, and replace this bad character, and only then parse it with XMLStreamReader but:
1) that approach seems really clumsy to me;
2) it will be a bit difficult to do as the code is quite involved there,
so I am not sure if I want to change it just for this character.

Why is the XMLStreamReader unable to handle this character?
Is the XML invalid or the parser has a bug and does not handle it well?

11
  • The XML is invalid. That character entity is in a forbidden range. (The correct term is, the XML is "not well-formed". It's worse than invalid.) Commented Aug 22, 2019 at 9:19
  • The &#x19; character is not allowed in XML 1.0 (w3.org/TR/xml/#charsets). I'm not sure if it helps you, but the character is allowed in XML 1.1 (w3.org/TR/xml11/#charsets). Commented Aug 22, 2019 at 9:24
  • @kumesana Which range is the forbidden range? Can you point me to some official/authoritative reference? Commented Aug 22, 2019 at 9:24
  • My XML file has this <?xml version="1.0" encoding="utf-8"?> so it declares that it's 1.0. Right? Commented Aug 22, 2019 at 9:25
  • 1
    XML doesn't actually exist in any other version but 1.0 anyway. Nothing supports anything but that. The definitions of valid character ranges is in w3.org/TR/xml/#charsets and the following section tells that character reference also must be in that range: w3.org/TR/xml/#NT-CharRef Commented Aug 22, 2019 at 9:27

2 Answers 2

1

The characters &, < and > (as well as " or ' in attributes) are invalid in XML.

They're escaped using XML entities, in this case you want &amp; for &.

Your XML is invalid with every correct library ; (You need may be correct the producer of this XML content )

**Edit* from https://www.w3.org/TR/xml/#NT-Char

Allowed range for a entity reference :

Reference ::= EntityRef | CharRef 
EntityRef ::=       '&' Name ';'
CharRef   ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Sign up to request clarification or add additional context in comments.

4 Comments

yes but something like &#x40; would be perfectly fine, and be taken as another way to write the character @. The problem is that the hexa value 19 is in a forbidden range.
What does that mean "You need may be correct the producer of this XML content"? I get this XML file from an API? I guess it means I should notify the vendor of this API telling them their XML file is invalid?
For me the producer API contains a bug : try to validate the XML with an another XML validator : you will obtain an error concerning &#x19. As say @kumesana, it's a fordidden range
"I guess it means I should notify the vendor of this API telling them their XML file is invalid?" - yup
1

The problem is that the XML that is being parsed is malformed - it contains \&#x19; character reference, which is not within the legal character range in XML 1.0.

This code snippet removes such characters from malformed XML strings.

    public static String removeInvalidXmlCharacterReferences(
            String xmlString
    ) {
        // regex to match character references:
        // "&#(?:x([0-9a-fA-F]+)|([0-9]+));"
        Pattern pattern = Pattern.compile(
                "&#" + // all character references start with &#
                "(?:" + // non-capture group, containing either...
                "x([0-9a-fA-F]+)|" + // (1) hex character reference OR
                "([0-9]+)" + // (2) decimal character reference
                ");" // end group, followed by ";"
        );
        // contains invalid references found in the content
        Set<String> invalidReferences = new HashSet<>();
        Matcher matcher = pattern.matcher(xmlString);
        while (matcher.find()) {
            String reference = matcher.group(0); // "&#2;" or "&#B"
            String hexMatch = matcher.group(1);  // "B"
            String intMatch = matcher.group(2);  // "2"
            int character = hexMatch != null ?
                    Integer.parseInt(hexMatch, 16) :
                    Integer.parseInt(intMatch);
            if (
                    character != 0x9 &&
                    character != 0xA &&
                    character != 0xD &&
                    (character < 0x20 || character > 0xD7FF) &&
                    (character < 0x10000 || character > 0x10FFFF)
            ) {
                // character is out of valid range
                // add "&#B" to invalid references
                invalidReferences.add(reference);
            }
        }
        if (invalidReferences.isEmpty()) {
            // no invalid references found, do not sanitize
            return xmlString;
        }
        // create a regex like: "&#2;|&#B"
        String invalidRefsRegex = String.join("|", invalidReferences);
        // remove "&#2;" or "&#B" from the XML
        return xmlString.replaceAll(invalidRefsRegex, "");
    }

It should be noted that illegal characters should be removed by the producer of the XML, but sometimes you don't have that option.

A version of the function is available as a more verbose XmlDeserUtils gist, which can be easily re-used.

This function was originally authored by Nicholas DiPiazza in This SO answer.

References:

W3C XML 1.0 Character sets (see character range):

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

W3C XML 1.0 Character and Entity References (see character references):

[66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.