1

for Example:

<div>
    this is first
    <div>
        second
   </div>
</div>

I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"

Help me out please!

EDIT

Using ownText() method will create problem in the following html code:

<div style="top:+0.2em; font-size:95%;">
    the
    <a href="/wiki/Free_content" title="Free content">
        free
    </a>
    <a href="/wiki/Encyclopedia" title="Encyclopedia">
        encyclopedia
    </a>
    that
    <a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">              
        anyone can edit
    </a>
    .
</div>

It will print:

the that.

free

encyclopedia

anyone can edit

But it must be:

the

that

.

encyclopedia

anyone can edit

1

4 Answers 4

2

If i extract text for first it will show "this is first second"

Use ownText() instead of text() and you'll get only the element contains directly.

Here's an example:

final String html = "<div>\n"
        + "    this is first\n"
        + "    <div>\n"
        + "        second\n"
        + "   </div>\n"
        + "</div>";

Document doc = Jsoup.parse(html); // Get your Document from somewhere


Element first = doc.select("div").first(); // Select 1st element - take the first found
String firstText = first.ownText(); // Get own text

Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
String secondText = second.ownText();

System.out.println("1st: " + firstText);
System.out.println("2nd: " + secondText);
Sign up to request clarification or add additional context in comments.

Comments

1

You can use XML parser, in whatever language you are using. Here is for Java: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/

Comments

1

It seems like you're using textContent in the div's to extract the content, which will get you the content of that element, and all descendent elements. (Java: this would be the getTextContent method on the Element)

Instead examine the childNodes (Java: getChildNodes method on the Element). The nodes have a property "nodeType" (Java: getNodeType) which you can look at to work out whether the node is a Text Node (Java: Node.TEXT_NODE), or an Element (Java: Node.ELEMENT_NODE). So to take you example you have a tree of Nodes which look like this...

div (Element)
    this is first (TextNode)
    div (Element)
        second (TextNode)

The outer div directly contains only two nodes - the first piece of text, and the inner div. That inner div then contains the text "second".

So loop over the nodes in the outer div, if the node is a text node, translate, otherwise recurse into the Element. Note that there are other kinds of nodes, Comments and the like, but for your purposes you can probably ignore those.

Assuming you're using the w3c DOM API http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html

Comments

0
 Elements divs=doc.getElementsByTag("div");

     for (Element element : divs) {
            System.out.println(element.text());

        }

This should work if you are using jsoup HTML parser.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.