Java XPath API Stripping HTML Tags from Text

Question

I am currently using the Java XPath API to extract some text from a String.

This String, however, often has HTML formatting (, , , etc). When I run my code, the HTML tags are stripped off. Is there any way to avoid this?

Here is a sample input:

<document>
    <summary>
    The <b>dog</b> jumped over the fence.
    </summary>
</document>

Here is a snippet of my code:

XPathFactory factory = XPathFactory.newInstance();  
XPath xPath = factory.newXPath();
InputSource source = new InputSource(new StringReader(xml));
String output = xPath.evaluate("/document/summary", source);

Here is the current output:

The dog jumped over the fence.

Here is the output I want:

The <b>dog</b> jumped over the fence.

Thanks in advance for all your help.

Do you have the ability to augment the values that the method xPath.evaluate(string,var) does? For example, looking at the xPath dot operator and seeing if you can avoid bold texts? — ElementCR
– ElementCR, Commented May 10, 2017 at 18:07

vanje · Accepted Answer · 2017-05-10 19:19:21Z

A simple straight forward (but maybe not very efficient) solution:

/**
 * Serializes a XML node to a string representation without XML declaration
 * 
 * @param node The XML node
 * @return The string representation
 * @throws TransformerFactoryConfigurationError
 * @throws TransformerException
 */
private static String node2String(Node node) throws TransformerFactoryConfigurationError, TransformerException {
  final Transformer transformer = TransformerFactory.newInstance().newTransformer();
  transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
  final StringWriter writer = new StringWriter();
  transformer.transform(new DOMSource(node), new StreamResult(writer));
  return writer.toString();
}

/**
 * Serializes the inner (child) nodes of a XML element.
 * @param el
 * @return
 * @throws TransformerFactoryConfigurationError
 * @throws TransformerException
 */
private static String elementInner2String(Element el) throws TransformerFactoryConfigurationError, TransformerException {
  final NodeList children = el.getChildNodes();
  final StringBuilder sb = new StringBuilder();
  for(int i = 0; i < children.getLength(); i++) {
    final Node child = children.item(i);
    sb.append(node2String(child));
  }
  return sb.toString();
}

Then the XPath evaluation should return the node instead of the string:

Element summaryElement = (Element) xpath.evaluate("/document/summary", doc, XPathConstants.NODE);
String output = elementInner2String(summaryElement);

Vitaliy · Accepted Answer · 2017-05-10 19:24:03Z

0

The <b>dog</b> jumped over the fence

Get children from this string. You will have 2 Text Nodes and one Element Node. Treat them accordingly.

answered May 10, 2017 at 19:24

Vitaliy

4897 silver badges23 bronze badges

Comments

eDog · Accepted Answer · 2017-05-11 20:02:36Z

0

As part of the parser, it will read the text as XML and will classify the contents of the node summary as text, node, text. When you use /document/summary, the resolver will return a string which is made up of all the descendants of the selected node. This give you text + node.text + text. This is the reason you lose the bold tag. The input string inside of summary should either be:

HTML encoded -or-
Wrapped in a CDATA tag.

Wrapping inside of CDATA tag treats the the contents as text:

<document>
<summary>
    <![CDATA[The <b>dog</b> jumped over the fence.]]>
</summary>

The problem with your solution is that the parser will want to treat as good xml structure. If you had an unbalanced tag inside summary, you would get an exception.

The solution to your question would be to loop over the elements to get text data while preserving the node names. This may work for your example, however, if you have an unbalanced tag it will break:

The <b>dog</b> jumped over <br> the fence

Don't use this solution to parse data between the summary tag. Instead either use CDATA or use some sort of regex to get content between the start and end points.

edited May 11, 2017 at 20:02

answered May 10, 2017 at 18:11

eDog

1731 silver badge5 bronze badges

3 Comments

user1472409 Over a year ago

Thanks for your help. The input is coming from a static database so I am not sure if I can edit the data.

VGR Over a year ago

The solution is correct, but … are not “invalid.” They simply represent an XML element—part of the document structure, rather than text. Placing everything in a CDATA causes the entire contents to be treated like text instead.

eDog Over a year ago

@VGR - you are right - not invalid for the parser, just different element types. Updated to show more information.

Collectives™ on Stack Overflow

Java XPath API Stripping HTML Tags from Text

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related