How to parse html from javascript variables with Jsoup in Java?

Question

I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?

Example:

<!DOCTYPE html>
<html>
<head>
    <script>
        var html = "<span>some text</span>";
    </script>
</head>
<body>
    <p>text</p>
</body>
</html>

In this example Jsoup only picks up the text from p tag which is what it's supposed to do. How do I pick up the text from var html span? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.

Atleast are you sure that the html content is within the double quotes and no other content exist in double quotes within the <script> tag? — Niranjan
– Niranjan, Commented Jul 29, 2013 at 21:42

Daniel B · Accepted Answer · 2015-06-12 07:54:43Z

6

You can use Jsoup to parse all the <script>-tags into DataNode-objects.

DataNode

A data node, for contents of style, script tags etc, where contents should not show in text().

 Elements scriptTags = doc.getElementsByTag("script");

This will give you all the Elements of tag <script>.

You can then use the getWholeData()-method to extract the node.

// Get the data contents of this node.
String    getWholeData()

 for (Element tag : scriptTags){                
        for (DataNode node : tag.dataNodes()) {
            System.out.println(node.getWholeData());
        }        
  }

Jsoup API - DataNode

edited Jun 12, 2015 at 7:54

answered Jul 29, 2013 at 11:42

Daniel B

8,8795 gold badges47 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:06:49Z

1

I am not so sure about the answer, but I saw a similar situation before here.

You probably can use Jsoup and manual parsing to get the text according to that answer.

I just modify that piece of code for your specific case:

Document doc = ...
Element script = doc.select("script").first(); // Get the script part


Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part


while( m.find() )
{
    System.out.println(m.group()); // the whole html text
    System.out.println(m.group(1)); // value only
}

Hope it will be helpful.

edited May 23, 2017 at 12:06

CommunityBot

11 silver badge

answered Nov 2, 2013 at 4:16

KK4SBB

114 bronze badges

Collectives™ on Stack Overflow

How to parse html from javascript variables with Jsoup in Java?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related