Simple way to do XML with HTML codes?

Question

I have an XML file, sample.xml, that contains the following:

<Tokens>
   <Token>Hello&nbsp;World</Token>
</Tokens>

I want to parse it - but get errors when it gets to the NBSP

I do not have access to the schema for the XML I am using (the one that defines Token or Tokens).

DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
doc = docBuilder.parse("sample.xml");

Since I do not have the Schema for my XML document, I was wondering if there is a way to have it completely ignore the HTML special characters while parsing?

Community · Accepted Answer · 2017-05-23 11:57:32Z

3

In XML,   is an entity reference, but an undefined one, unless you provide a definition. You cannot make an XML parser ignore them, but you can define them, e.g. starting your document with

<!DOCTYPE Tokens [<!ENTITY nbsp "&#xa0;">]>

However, this is probably not useful if you are generating the XML file. You might just as well generate a document containing the real character “ ” U+00A0 NO-BREAK SPACE, or the character reference   or its decimal equivalent  .

Cf. to question How do I define HTML entity references inside a valid XML document?

edited May 23, 2017 at 11:57

CommunityBot

11 silver badge

answered Oct 4, 2013 at 18:36

Jukka K. Korpela

203k38 gold badges281 silver badges408 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Raedwald · Accepted Answer · 2013-10-04 17:56:57Z

0

What you ask for is impossible because to parse ask XML the entity must have a definition somewhere. To parse it as other than XML you need to write your own parser, or use a tolerant parser. XML is not tag soup.

answered Oct 4, 2013 at 17:56

Raedwald

49.2k49 gold badges162 silver badges250 bronze badges

Comments

Sage · Accepted Answer · 2013-10-04 18:05:19Z

0

XML doesn’t support &nbsp, although XHTML does. Check the predefined entities in XML list

The solution is to use the Unicode non-breaking space character &#160 while building XML; instead. In some cases a plain space works too (&#32;). Before parsing the XML you can try to replace &nbsp with a ' '-space though.

answered Oct 4, 2013 at 18:05

Sage

15.5k3 gold badges36 silver badges40 bronze badges

Comments

pravat · Accepted Answer · 2013-10-04 18:07:41Z

0

I agree with Reedwald. But as a workaround you can read the file as string and replace the with spaces before parsing the document.

answered Oct 4, 2013 at 18:07

pravat

4753 silver badges11 bronze badges

Collectives™ on Stack Overflow

Simple way to do XML with HTML codes?

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related