0

very new to python and trying to web scrape a website table, but I think the table data is seemingly from a Javascript variable with a JSON.parse. However the parse is not what I am used to and am unsure of how to use it in python.

The code is from this website, specifically it is var playersData = JSON.parse('\x5B\x7B\x22id\x3A,... (roughly 250,000 characters) nestled in a script tag.

So far I have managed to scrape the website using bs4, find the specific script and attempt to use re.search to find just the JSON.parse and find this <re.Match object; span=(2, 259126), match="var playersData\t= JSON.parse('\\x5B\\x7B\\x22id\> from the search.

I would then like to export the data somewhere else after loading the JSON parse.

Here is my code so far:

import requests
from bs4 import BeautifulSoup
import json
import re

response = requests.get('https://understat.com/league/EPL/2018')
soup = BeautifulSoup(response.text, 'lxml')

playerscript = soup.find_all('script')[3].string
m = re.search("var playersData  = (.*)", playerscript)

Thanks for any help.

2
  • Did you have a question? Commented Nov 7, 2018 at 17:06
  • Yes, mainly how do I use the javascript variable with the JSON.parse in python to get the table data from the website? Commented Nov 7, 2018 at 17:29

1 Answer 1

1

you don't need BeautifulSoup. in python json.loads same as JSON.parse and you need to convert the string using .decode('string_escape') or bytes('....', 'utf-8').decode('unicode_escape') for python 3

import requests
import json
import re

response = requests.get('https://understat.com/league/EPL/2018')
playersData = re.search("playersData\s+=\s+JSON.parse\('([^']+)", response.text)
# python 2.7
# decoded_string = playersData.groups()[0].decode('string_escape')
decoded_string = bytes(playersData.groups()[0], 'utf-8').decode('unicode_escape')
playerObj = json.loads(decoded_string)

print(playerObj[0]['player_name'])
Sign up to request clarification or add additional context in comments.

1 Comment

I have tried using this code, however I receive the error: AttributeError: 'str' object has no attribute 'decode'. If I remove the .decode part, then I get the error raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.