Scraping Javascript variables into Python

Question

I want to scrape the following data from http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/:

  var hoodFeatures = {
            type: "FeatureCollection",
            features: [{
                type: "Feature",
                properties: {
                    name: "Koreatown",
                    slug: "koreatown",
                    url: "/neighborhoods/neighborhood/koreatown/",
                    has_statistics: true,
                    label: 'Rank: 1<br>Population per Sqmi: 42,611',
                    population: "115,070",
                    stratum: "high"
                },
                geometry: { "type": "MultiPolygon", "coordinates": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }
            },

From the above html, I want to take each of:

name
population per sqmi
population
geometry

and turn it into a data frame by name

So far I've tried

import requests
import json
from bs4 import BeautifulSoup

response_obj = requests.get('http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/').text
soup = BeautifulSoup(response_obj,'lxml')

The object has the script info, but I don't understand how to use the json module as advised in this thread: Parsing variable data out of a javascript tag using python

json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
value = json.loads(json_text)
value

I get this error

TypeError                                 Traceback (most recent call last)
<ipython-input-12-37c4c0188ed0> in <module>
      1 #Splits the text on the first bracket and last bracket of the javascript into JSON format
----> 2 json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
      3 value = json.loads(json_text)
      4 value
      5 #import pprint

TypeError: 'NoneType' object is not callable

Any suggestions? Thanks

soup is not string and it may tread partition as tag name <partition> which not exists and you get None. You would have to work with soup.text which is a string. You could also find tag <script> to work only with text wich may have javascript code - code = soup.find('script').text — furas
– furas, Commented Jul 1, 2019 at 12:29

Emma Marcier · Accepted Answer · 2019-07-01 05:32:10Z

I'm not really sure how to do that with the beautiful soup, yet another option might be to likely design an expression and extract our desired values:

(?:name|population per sqmi|population)\s*:\s*"?(.*?)\s*["']|(?:geometry)\s*:\s*({.*})

Demo

Test

import re

regex = r"(?:name|population per sqmi|population)\s*:\s*\"?(.*?)\s*[\"']|(?:geometry)\s*:\s*({.*})"

test_str = ("var hoodFeatures = {\n"
    "            type: \"FeatureCollection\",\n"
    "            features: [{\n"
    "                type: \"Feature\",\n"
    "                properties: {\n"
    "                    name: \"Koreatown\",\n"
    "                    slug: \"koreatown\",\n"
    "                    url: \"/neighborhoods/neighborhood/koreatown/\",\n"
    "                    has_statistics: true,\n"
    "                    label: 'Rank: 1<br>Population per Sqmi: 42,611',\n"
    "                    population: \"115,070\",\n"
    "                    stratum: \"high\"\n"
    "                },\n"
    "                geometry: { \"type\": \"MultiPolygon\", \"coordinates\": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }\n"
    "            },\n")

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

abdusco · Accepted Answer · 2019-07-01 05:49:58Z

You can't really use json.loads because hoodFeatures object is not really a json. In a proper json, every key is surrounded with double quotes "

You could try adding quotes around keys manually (using regular expressions).
Another option is using Selenium to execute that JS and get the JSON.stringify output of it.

Answer using manual cleanup:

This one cleans up JS code and turns it into a JSON that can be parsed properly. However keep in mind that it is by no means robust, and may break at any sight of unexpected input.

import json
import re

js = '''
 var hoodFeatures = {
            type: "FeatureCollection",
            features: [
            {
                type: "Feature",
                properties: {
                    name: "Beverlywood",
                    slug: "beverlywood",
                    url: "/neighborhoods/neighborhood/beverlywood/",
                    has_statistics: true,
                    label: 'Rank: 131<br>Population per Sqmi: 7,654',
                    population: "6,080",
                    stratum: "middle"
                },
                geometry: {  }
            }]
        }
'''

if __name__ == '__main__':
    unprefixed = js.split('{', maxsplit=1)[1]
    unsuffixed = unprefixed.rsplit('}', maxsplit=1)[0]
    quotes_replaced = unsuffixed.replace('\'', '"')
    rebraced = f'{{{quotes_replaced}}}'
    keys_quoted = []
    for line in rebraced.splitlines():
        line = re.sub('^\s+([^:]+):', '"\\1":', line)
        keys_quoted.append(line)
    json_raw = '\n'.join(keys_quoted)
    # print(json_raw)
    parsed = json.loads(json_raw)
    for feat in parsed['features']:
        props = feat['properties']
        name, pop = props['name'], int(props['population'].replace(',', ''))
        geo = feat['geometry']
        pop_per_sqm = re.findall('per Sqmi: ([\d,]+)', props['label'])[0].replace(',', '')
        pop_per_sqm = int(pop_per_sqm)

        print(name, pop, pop_per_sqm, geo)

Collectives™ on Stack Overflow

Scraping Javascript variables into Python

2 Answers 2

Demo

Test

Comments

Answer using manual cleanup:

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Test

Comments

Answer using manual cleanup:

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related