2

I have xml content as

<p/>
<p>Highlighted Applications</p>
<p/>
<table>
<tbody>
<tr>    <td> 
<p>Projects </p>
</td>   <td>
<p>Description</p>
</td>
</tr>
<tr>    <td>
<p>VNC login for Windows Mobile devices</p>
</td>   <td>

It may have custom tag, which I don't know in advance. Is it possible to get text from above xml without walking xml-tree and removing each tag one by one in java/scala. I came across this, but this is to remove unnecessary tag not removing all tag? I am looking for some generic kind of solution, which can remove all tag or get all text from xml.

Required Output:

Highlighted Applications
Projects
Description
VNC login for Windows Mobile devices

I'm open to any other approach/library suggestion?

0

3 Answers 3

3

If you can get all the content of your xml file as a String i would suggest this way :
You can use replaceAll with regex \<.*?\> like this :

str.replaceAll("\\<.*?\\>", "")

to replace all the empty line you can use :

str.replaceAll("(?m)^[ \t]*\r?\n", "")

You can take a look about this here remove all empty lines


the output in the end should look like :

Highlighted Applications
Projects 
Description
VNC login for Windows Mobile devices
Sign up to request clarification or add additional context in comments.

5 Comments

OMG! At first look, I thought it won't work for custom tag, but after testing it worked like a charm. Unbelievable! Regex, you live long. @YCF_L, thanks a ton buddy. You saved my day.
What about <foo><![CDATA[this text will be gone!]]></foo>?
yes @Dima i answered according to the example that the OP share, do you have an idea :)?
I do ... Give me a moment.
@YCF_L just added my idea as another answer
2

The correct way to do it is something like this:

def extractText(nodes: Seq[xml.Node]): Seq[String] =  nodes.flatMap {
 case xml.Text(t) => Seq(t)
 case n => extractText(n.child)
}

Then you can do

extractText(xml.XML.loadString(xmlToParse))
  .filter(_.matches(".*\\S.*"))
  .mkString("\n")

Regex, as the other answer suggests (you don't need to escape < and > with backslashes BTW, and also \s is a metacharacter you can use instead of enumerating all possible whitespace symbols), is a simpler solution, that will work most of the time, but break down on some corner cases.

For the purists, here is also a tail-recursive version (helps particularly if your document structure is really-really-really deep :))

@tailrec
def extractText(nodes: Seq[xml.Node], result: List[String] = Nil): Seq[String] = nodes match { 
  case Seq() => result.reverse
  case Seq(xml.Text(t), tail@_*) => extractText(tail, t :: result)
  case Seq(head, tail@_*) => extractText(head.child ++ tail, result)
}

Comments

0

Using Jsoup#text

Gets the combined text of this element and all its children. Whitespace is normalized and trimmed. For example, given HTML <p>Hello <b>there</b> now! </p>, p.text() returns "Hello there now!"

String s = //..html code
System.out.println(Jsoup.parse(s).text());

2 Comments

It gave me all text in oneline, without newline. Is it possible to get text with newline info?
@OmPrakash yes, What you're trying to achieve cannot be done this simply. A regex would work here, but think you put all your html in one line, then the regex would fail too. You have to go element by element in that case. However, if you know that the html is correctly formatted, you should select the answer from YCF_L

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.