I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?
Example:
<!DOCTYPE html>
<html>
<head>
<script>
var html = "<span>some text</span>";
</script>
</head>
<body>
<p>text</p>
</body>
</html>
In this example Jsoup only picks up the text from p tag which is what it's supposed to do. How do I pick up the text from var html span? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.
htmlcontent is within the double quotes and no other content exist in double quotes within the<script>tag?