Java StAX - error when parsing - Illegal character entity: expansion character code 0x19

Question

I am reading/parsing an XML file with javax.xml.stream.XMLStreamReader.
The file contains this piece of XML data as shown below.

<Row>
  <AccountName value="Paving 101" />
  <AccountNumber value="20205" />
  <AccountId value="15012" />
  <TimePeriod value="2019-08-20" />
  <CampaignName value="CMP Paving 101" />
  <CampaignId value="34283" />
  <AdGroupName value="residential paving" />
  <AdGroupId value="1001035" />
  <AdId value="790008" />
  <AdType value="Expanded text ad" />
  <DestinationUrl value="" />
  <BidMatchType value="Broad" />
  <Impressions value="1" />
  <Clicks value="1" />
  <Ctr value="100.00%" />
  <AverageCpc value="1.05" />
  <Spend value="1.05" />
  <AveragePosition value="2.00" />
  <SearchQuery value="concrete&#x19;driveway&#x19;repair&#x19;methods" />
</Row>

Unfortunately I am getting this error and I am not sure how to resolve it.

    Error in downloadXML: 
    com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x19
     at [row,col {unknown-source}]: [674,40]
        at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:606)
        at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:479)
        at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2448)
        at com.ctc.wstx.sr.StreamScanner.validateChar(StreamScanner.java:2395)
        at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1218)
        at com.ctc.wstx.sr.BasicStreamReader.parseAttrValue(BasicStreamReader.java:1929)
        at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3063)
        at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2961)
        at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2837)
        at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1072)

The problem seems to be with this character &#x19.
Of course I can first read the file simply as a text file, and replace this bad character, and only then parse it with XMLStreamReader but:
1) that approach seems really clumsy to me;
2) it will be a bit difficult to do as the code is quite involved there,
so I am not sure if I want to change it just for this character.

Why is the XMLStreamReader unable to handle this character?
Is the XML invalid or the parser has a bug and does not handle it well?

The XML is invalid. That character entity is in a forbidden range. (The correct term is, the XML is "not well-formed". It's worse than invalid.) — kumesana
– kumesana, Commented Aug 22, 2019 at 9:19
The  character is not allowed in XML 1.0 (w3.org/TR/xml/#charsets). I'm not sure if it helps you, but the character is allowed in XML 1.1 (w3.org/TR/xml11/#charsets). — mzjn
– mzjn, Commented Aug 22, 2019 at 9:24
@kumesana Which range is the forbidden range? Can you point me to some official/authoritative reference? — peter.petrov
– peter.petrov, Commented Aug 22, 2019 at 9:24
My XML file has this <?xml version="1.0" encoding="utf-8"?> so it declares that it's 1.0. Right? — peter.petrov
– peter.petrov, Commented Aug 22, 2019 at 9:25
XML doesn't actually exist in any other version but 1.0 anyway. Nothing supports anything but that. The definitions of valid character ranges is in w3.org/TR/xml/#charsets and the following section tells that character reference also must be in that range: w3.org/TR/xml/#NT-CharRef — kumesana
– kumesana, Commented Aug 22, 2019 at 9:27

Indent · Accepted Answer · 2019-08-22 09:47:44Z

1

The characters &, < and > (as well as " or ' in attributes) are invalid in XML.

They're escaped using XML entities, in this case you want & for &.

Your XML is invalid with every correct library ; (You need may be correct the producer of this XML content )

**Edit* from https://www.w3.org/TR/xml/#NT-Char

Allowed range for a entity reference :

Reference ::= EntityRef | CharRef 
EntityRef ::=       '&' Name ';'
CharRef   ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

edited Aug 22, 2019 at 9:47

answered Aug 22, 2019 at 9:20

Indent

4,9671 gold badge22 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

kumesana Over a year ago

yes but something like @ would be perfectly fine, and be taken as another way to write the character @. The problem is that the hexa value 19 is in a forbidden range.

peter.petrov Over a year ago

What does that mean "You need may be correct the producer of this XML content"? I get this XML file from an API? I guess it means I should notify the vendor of this API telling them their XML file is invalid?

Indent Over a year ago

For me the producer API contains a bug : try to validate the XML with an another XML validator : you will obtain an error concerning &#x19. As say @kumesana, it's a fordidden range

Brian Agnew Over a year ago

"I guess it means I should notify the vendor of this API telling them their XML file is invalid?" - yup

Vanja D. · Accepted Answer · 2024-07-19 11:02:41Z

The problem is that the XML that is being parsed is malformed - it contains \ character reference, which is not within the legal character range in XML 1.0.

This code snippet removes such characters from malformed XML strings.

    public static String removeInvalidXmlCharacterReferences(
            String xmlString
    ) {
        // regex to match character references:
        // "&#(?:x([0-9a-fA-F]+)|([0-9]+));"
        Pattern pattern = Pattern.compile(
                "&#" + // all character references start with &#
                "(?:" + // non-capture group, containing either...
                "x([0-9a-fA-F]+)|" + // (1) hex character reference OR
                "([0-9]+)" + // (2) decimal character reference
                ");" // end group, followed by ";"
        );
        // contains invalid references found in the content
        Set<String> invalidReferences = new HashSet<>();
        Matcher matcher = pattern.matcher(xmlString);
        while (matcher.find()) {
            String reference = matcher.group(0); // "&#2;" or "&#B"
            String hexMatch = matcher.group(1);  // "B"
            String intMatch = matcher.group(2);  // "2"
            int character = hexMatch != null ?
                    Integer.parseInt(hexMatch, 16) :
                    Integer.parseInt(intMatch);
            if (
                    character != 0x9 &&
                    character != 0xA &&
                    character != 0xD &&
                    (character < 0x20 || character > 0xD7FF) &&
                    (character < 0x10000 || character > 0x10FFFF)
            ) {
                // character is out of valid range
                // add "&#B" to invalid references
                invalidReferences.add(reference);
            }
        }
        if (invalidReferences.isEmpty()) {
            // no invalid references found, do not sanitize
            return xmlString;
        }
        // create a regex like: "&#2;|&#B"
        String invalidRefsRegex = String.join("|", invalidReferences);
        // remove "&#2;" or "&#B" from the XML
        return xmlString.replaceAll(invalidRefsRegex, "");
    }

It should be noted that illegal characters should be removed by the producer of the XML, but sometimes you don't have that option.

A version of the function is available as a more verbose XmlDeserUtils gist, which can be easily re-used.

This function was originally authored by Nicholas DiPiazza in This SO answer.

References:

W3C XML 1.0 Character sets (see character range):

W3C XML 1.0 Character and Entity References (see character references):

[66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'

Collectives™ on Stack Overflow

Java StAX - error when parsing - Illegal character entity: expansion character code 0x19

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related