How to remove all tag from xml file using java/Scala?

Question

I have xml content as

<p/>
<p>Highlighted Applications</p>
<p/>
<table>
<tbody>
<tr>    <td> 
<p>Projects </p>
</td>   <td>
<p>Description</p>
</td>
</tr>
<tr>    <td>
<p>VNC login for Windows Mobile devices</p>
</td>   <td>

It may have custom tag, which I don't know in advance. Is it possible to get text from above xml without walking xml-tree and removing each tag one by one in java/scala. I came across this, but this is to remove unnecessary tag not removing all tag? I am looking for some generic kind of solution, which can remove all tag or get all text from xml.

Required Output:

Highlighted Applications
Projects
Description
VNC login for Windows Mobile devices

I'm open to any other approach/library suggestion?

Community · Accepted Answer · 2017-05-23 12:26:27Z

3

If you can get all the content of your xml file as a String i would suggest this way :
You can use replaceAll with regex \<.*?\> like this :

str.replaceAll("\\<.*?\\>", "")

to replace all the empty line you can use :

str.replaceAll("(?m)^[ \t]*\r?\n", "")

You can take a look about this here remove all empty lines

the output in the end should look like :

Highlighted Applications
Projects 
Description
VNC login for Windows Mobile devices

edited May 23, 2017 at 12:26

CommunityBot

11 silver badge

answered May 20, 2017 at 12:53

Youcef LAIDANI

60.3k21 gold badges111 silver badges178 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Om Prakash Over a year ago

OMG! At first look, I thought it won't work for custom tag, but after testing it worked like a charm. Unbelievable! Regex, you live long. @YCF_L, thanks a ton buddy. You saved my day.

Dima Over a year ago

What about <foo><![CDATA[this text will be gone!]]></foo>?

Youcef LAIDANI Over a year ago

yes @Dima i answered according to the example that the OP share, do you have an idea :)?

Dima Over a year ago

I do ... Give me a moment.

Dima Over a year ago

@YCF_L just added my idea as another answer

Dima · Accepted Answer · 2017-05-21 13:31:23Z

The correct way to do it is something like this:

def extractText(nodes: Seq[xml.Node]): Seq[String] =  nodes.flatMap {
 case xml.Text(t) => Seq(t)
 case n => extractText(n.child)
}

Then you can do

extractText(xml.XML.loadString(xmlToParse))
  .filter(_.matches(".*\\S.*"))
  .mkString("\n")

Regex, as the other answer suggests (you don't need to escape < and > with backslashes BTW, and also \s is a metacharacter you can use instead of enumerating all possible whitespace symbols), is a simpler solution, that will work most of the time, but break down on some corner cases.

For the purists, here is also a tail-recursive version (helps particularly if your document structure is really-really-really deep :))

@tailrec
def extractText(nodes: Seq[xml.Node], result: List[String] = Nil): Seq[String] = nodes match { 
  case Seq() => result.reverse
  case Seq(xml.Text(t), tail@_*) => extractText(tail, t :: result)
  case Seq(head, tail@_*) => extractText(head.child ++ tail, result)
}

dumbPotato21 · Accepted Answer · 2017-05-20 13:06:44Z

0

Using Jsoup#text

Gets the combined text of this element and all its children. Whitespace is normalized and trimmed. For example, given HTML <p>Hello <b>there</b> now! </p>, p.text() returns "Hello there now!"

String s = //..html code
System.out.println(Jsoup.parse(s).text());

edited May 20, 2017 at 13:06

answered May 20, 2017 at 12:59

dumbPotato21

5,7035 gold badges25 silver badges36 bronze badges

2 Comments

Om Prakash Over a year ago

It gave me all text in oneline, without newline. Is it possible to get text with newline info?

dumbPotato21 Over a year ago

@OmPrakash yes, What you're trying to achieve cannot be done this simply. A regex would work here, but think you put all your html in one line, then the regex would fail too. You have to go element by element in that case. However, if you know that the html is correctly formatted, you should select the answer from YCF_L

Collectives™ on Stack Overflow

How to remove all tag from xml file using java/Scala?

3 Answers 3

5 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related