3

I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?

Example:

<!DOCTYPE html>
<html>
<head>
    <script>
        var html = "<span>some text</span>";
    </script>
</head>
<body>
    <p>text</p>
</body>
</html>

In this example Jsoup only picks up the text from p tag which is what it's supposed to do. How do I pick up the text from var html span? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.

1
  • Atleast are you sure that the html content is within the double quotes and no other content exist in double quotes within the <script> tag? Commented Jul 29, 2013 at 21:42

2 Answers 2

6

You can use Jsoup to parse all the <script>-tags into DataNode-objects.

DataNode

A data node, for contents of style, script tags etc, where contents should not show in text().

 Elements scriptTags = doc.getElementsByTag("script");

This will give you all the Elements of tag <script>.

You can then use the getWholeData()-method to extract the node.

// Get the data contents of this node.
String    getWholeData() 
 for (Element tag : scriptTags){                
        for (DataNode node : tag.dataNodes()) {
            System.out.println(node.getWholeData());
        }        
  }

Jsoup API - DataNode

Sign up to request clarification or add additional context in comments.

Comments

1

I am not so sure about the answer, but I saw a similar situation before here.

You probably can use Jsoup and manual parsing to get the text according to that answer.

I just modify that piece of code for your specific case:

Document doc = ...
Element script = doc.select("script").first(); // Get the script part


Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part


while( m.find() )
{
    System.out.println(m.group()); // the whole html text
    System.out.println(m.group(1)); // value only
}

Hope it will be helpful.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.