Java: How do I extract separated text from nested <div> in HTML?

Question

for Example:

<div>
    this is first
    <div>
        second
   </div>
</div>

I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"

Help me out please!

EDIT

Using ownText() method will create problem in the following html code:

<div style="top:+0.2em; font-size:95%;">
    the
    <a href="/wiki/Free_content" title="Free content">
        free
    </a>
    <a href="/wiki/Encyclopedia" title="Encyclopedia">
        encyclopedia
    </a>
    that
    <a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">              
        anyone can edit
    </a>
    .
</div>

It will print:

the that.

free

encyclopedia

anyone can edit

But it must be:

the

that

.

encyclopedia

anyone can edit

Consider providing an actual runnable example that demonstrates your problem would involve less guess work and better responses — MadProgrammer
– MadProgrammer, Commented Jun 3, 2014 at 7:10

ollo · Accepted Answer · 2014-06-03 11:39:18Z

2

If i extract text for first it will show "this is first second"

Use ownText() instead of text() and you'll get only the element contains directly.

Here's an example:

final String html = "<div>\n"
        + "    this is first\n"
        + "    <div>\n"
        + "        second\n"
        + "   </div>\n"
        + "</div>";

Document doc = Jsoup.parse(html); // Get your Document from somewhere


Element first = doc.select("div").first(); // Select 1st element - take the first found
String firstText = first.ownText(); // Get own text

Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
String secondText = second.ownText();

System.out.println("1st: " + firstText);
System.out.println("2nd: " + secondText);

answered Jun 3, 2014 at 11:39

ollo

25.5k15 gold badges112 silver badges158 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Daniel · Accepted Answer · 2014-06-03 07:56:15Z

1

You can use XML parser, in whatever language you are using. Here is for Java: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/

answered Jun 3, 2014 at 7:56

Daniel

6,05710 gold badges50 silver badges87 bronze badges

Comments

Peter Bagnall · Accepted Answer · 2014-06-03 08:41:37Z

It seems like you're using textContent in the div's to extract the content, which will get you the content of that element, and all descendent elements. (Java: this would be the getTextContent method on the Element)

Instead examine the childNodes (Java: getChildNodes method on the Element). The nodes have a property "nodeType" (Java: getNodeType) which you can look at to work out whether the node is a Text Node (Java: Node.TEXT_NODE), or an Element (Java: Node.ELEMENT_NODE). So to take you example you have a tree of Nodes which look like this...

div (Element)
    this is first (TextNode)
    div (Element)
        second (TextNode)

The outer div directly contains only two nodes - the first piece of text, and the inner div. That inner div then contains the text "second".

So loop over the nodes in the outer div, if the node is a text node, translate, otherwise recurse into the Element. Note that there are other kinds of nodes, Comments and the like, but for your purposes you can probably ignore those.

Assuming you're using the w3c DOM API http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html

Aniket Kulkarni · Accepted Answer · 2014-06-03 08:46:32Z

0

 Elements divs=doc.getElementsByTag("div");

     for (Element element : divs) {
            System.out.println(element.text());

        }

This should work if you are using jsoup HTML parser.

answered Jun 3, 2014 at 8:46

Aniket Kulkarni

2,1452 gold badges18 silver badges25 bronze badges

Collectives™ on Stack Overflow

Java: How do I extract separated text from nested <div> in HTML?

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related