0

So I'm trying to fetch specific data from a site that has a deeply nested <script> tag.

Using import json, in hope trying to make things a bit easier, results into the famous Expecting value: line 1 column 1 (char 0) error. So, I tried the following approach 1 with zero success.

In essence, the relative simple steps of connecting to the site, catching the specific <script> tag is no problem. Getting the data out of it that I need seems problematic.

Assume the following element:

script_tag = '''
<script id="startup" type="text/javascript">
$(document).ready(function () {createJsonChart({
"series":[{"name":"BNames","color":"#0043de","legendIndex":0,
"stack":null,
"data":[{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""},
{"name":"BNames","color":"#0043de","y":114.6,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 114,60 % <br/> Month: oktober 2018"},
{"name":"BNames","color":"#0043de","y":108.5,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 108,50 % <br/> Month: september 2019"},
{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,
"fillColor":null,"symbol":null,"radius":4},
"dashStyle":"Solid","lineWidth":2,
"step":"center","zIndex":"2","name":"Mandatory","color":"#f20808",
"legendIndex":0,"stack":1,
"data":[{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,"fillColor":null,
"symbol":null,"radius":4},"dashStyle":"Solid","lineWidth":2,
"step":"center", "zIndex":"2","name":"Preferred","color":"#38d615",
"legendIndex":0,"stack":2,
"data":[{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"}]}],
"resizeElement":null,"credits":{"enabled":false}});$('#__Page').lumnaInit('');});
</script>
'''

In reality this <script> tag is even bigger. It contains 3 parts of data, named here BNames, Mandatory and Preferred. I need the data from BNames, specifically the last entry. So the expected result would come from the part "tooltip":"BNames: 108,50 % <br/> Month: september 2019"} with BNames: 108,50 % in one variable and Month: september 2019 in another.

Answer with using regex

url_part=soup.find("script", attrs={'id':'startup'}).text
info=re.findall(r'\s\w*\s\d*', url_part)[-1]
result=re.findall(r'(BNames: (\d+[,]\d+\s[%]))', url_part)[-1][1]

First define which HTML tag to approach. Second, find all instances of occurrences with any size of letters (\w*) followed by whitespace (\s) and any size of numbers (\d*). This matches anything written like september 2019 or august 2019. Last, look for instances that match BNames: with numbers that follow in this setup: number, a comma, number, whitespace and percent-sign. Hence (\d+[,]\d+\s[%] This does match everything from 80,6 % to 120,05 %

1
  • don't have to go deep into it, just use regex to search for the text inside the script tag, i don't like to use scraping function just to handle javascript tag, regex is way faster. I already answered this one here javascript-scrape Commented Oct 30, 2019 at 8:50

1 Answer 1

3

Use following regex matching on the Beleidsdekkingsgraad strings. Same idea for the BNames.

import re, requests

r = requests.get('https://www.pensioenfondstno.nl/overons/dekkingsgraad')
p = re.compile(r'"(Beleidsdekkingsgraad:[\s\S]*?)"', re.MULTILINE)
data = p.findall(r.text)[-1].split(' <br/> ')
print(data[0])
print(data[1])

Regex:

enter image description here

Sign up to request clarification or add additional context in comments.

6 Comments

Well QHarr it comes very close and does work excellent for the sample data. However, in the real scenario it catches too much information then necessary. It is really a naggy thing. So, as per Linh Nguyen suggested, look at regex. So I did. The final code is added to my question. Although it is not as neat as yours, it does the trick. In case you have any suggestions to improve the code, I'm all ears.
Can you provide the source url?
source url. You should see BNames here as Beleidsdekkingsgraad and Month as Periode in the original setup
And you want just those two values?
There is some beautifulness behind the power of regex. Tested it and works perfectly. As such accepted your answer as the solution.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.