0

I am trying to scrape a javascript web page. Having read some of the posts I managed to write the following:

from bs4 import BeautifulSoup
import requests
website_url = requests.get('https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm').text
soup= BeautifulSoup(website_url,'lxml')
print(soup.prettify())

and recover the following scripts as follows:

soup.find_all('script')[3]

which gives:

<script type="text/javascript">
            // Initialize script parameters.
            var exportTitle ="Centralised medicinal products for human use by ATC code";

            // Initialise the dataset.
            var dataSet = [
{"id":"A","parent":"#","text":"A - Alimentary tract and metabolism"},
{"id":"A02","parent":"A","text":"A02 - Drugs for acid related disorders"},
{"id":"A02B","parent":"A02","text":"A02B - Drugs for treatment of peptic ulcer"},
{"id":"A02BC","parent":"A02B","text":"A02BC - Proton pump inhibitors"},
{"id":"A02BC01","parent":"A02BC","text":"A02BC01 - omeprazole"},
{"id":"ho15861","parent":"A02BC01","text":"Losec and associated names (referral)","type":"pl"},
...
{"id":"h154","parent":"V09IA05","text":"NeoSpect (withdrawn)","type":"pl"},
{"id":"V09IA09","parent":"V09IA","text":"V09IA09 - technetium (<sup>99m</sup>Tc) tilmanocept"},
{"id":"h955","parent":"V09IA09","text":"Lymphoseek (active)","type":"pl"},
{"id":"V09IB","parent":"V09I","text":"V09IB - Indium (<sup>111</sup>In) compounds"},
{"id":"V09IB03","parent":"V09IB","text":"V09IB03 - indium (<sup>111</sup>In) antiovariumcarcinoma antibody"},{"id":"h025","parent":"V09IB03","text":"Indimacis 125 (withdrawn)","type":"pl"},
...
]; </script>

Now the problem that I am facing is to apply .text() to soup.find_all('script')[3] and recover a json file from that. When I try to apply .text(), the result is an empty string: ''.

So my question is: why is that? Ideally I would like to end up with:

A02BC01 Losec and associated names (referral)
...
V09IA05 NeoSpect (withdrawn)
V09IA09 Lymphoseek
V09IB03 Indimacis 125 (withdrawn)
...

1 Answer 1

2

Firstly, you get the text and after that, some string processing - get all the text after 'dataSet = ' and remove the last ';' to have a beautiful JSON array. At the end to process the JSON array in small jsons and print the data.

data = soup.find_all("script")[3].string
dataJson = data.split('dataSet = ')[1].split(';')[0]
jsonArray = json.loads(dataJson)
for jsonElement in jsonArray:
    print(jsonElement['parent'], end=' ')
    print(jsonElement['text'])
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, this works perfectly! I just missed the fact that I should use string method rather than .text. The latter does not work but cannot figure out why

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.