0

I want to scrape the following data from http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/:

  var hoodFeatures = {
            type: "FeatureCollection",
            features: [{
                type: "Feature",
                properties: {
                    name: "Koreatown",
                    slug: "koreatown",
                    url: "/neighborhoods/neighborhood/koreatown/",
                    has_statistics: true,
                    label: 'Rank: 1<br>Population per Sqmi: 42,611',
                    population: "115,070",
                    stratum: "high"
                },
                geometry: { "type": "MultiPolygon", "coordinates": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }
            },

From the above html, I want to take each of:

name
population per sqmi
population
geometry

and turn it into a data frame by name

So far I've tried

import requests
import json
from bs4 import BeautifulSoup

response_obj = requests.get('http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/').text
soup = BeautifulSoup(response_obj,'lxml')

The object has the script info, but I don't understand how to use the json module as advised in this thread: Parsing variable data out of a javascript tag using python

json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
value = json.loads(json_text)
value

I get this error

TypeError                                 Traceback (most recent call last)
<ipython-input-12-37c4c0188ed0> in <module>
      1 #Splits the text on the first bracket and last bracket of the javascript into JSON format
----> 2 json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
      3 value = json.loads(json_text)
      4 value
      5 #import pprint

TypeError: 'NoneType' object is not callable

Any suggestions? Thanks

1
  • soup is not string and it may tread partition as tag name <partition> which not exists and you get None. You would have to work with soup.text which is a string. You could also find tag <script> to work only with text wich may have javascript code - code = soup.find('script').text Commented Jul 1, 2019 at 12:29

2 Answers 2

0

I'm not really sure how to do that with the beautiful soup, yet another option might be to likely design an expression and extract our desired values:

(?:name|population per sqmi|population)\s*:\s*"?(.*?)\s*["']|(?:geometry)\s*:\s*({.*})

Demo

Test

import re

regex = r"(?:name|population per sqmi|population)\s*:\s*\"?(.*?)\s*[\"']|(?:geometry)\s*:\s*({.*})"

test_str = ("var hoodFeatures = {\n"
    "            type: \"FeatureCollection\",\n"
    "            features: [{\n"
    "                type: \"Feature\",\n"
    "                properties: {\n"
    "                    name: \"Koreatown\",\n"
    "                    slug: \"koreatown\",\n"
    "                    url: \"/neighborhoods/neighborhood/koreatown/\",\n"
    "                    has_statistics: true,\n"
    "                    label: 'Rank: 1<br>Population per Sqmi: 42,611',\n"
    "                    population: \"115,070\",\n"
    "                    stratum: \"high\"\n"
    "                },\n"
    "                geometry: { \"type\": \"MultiPolygon\", \"coordinates\": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }\n"
    "            },\n")

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Sign up to request clarification or add additional context in comments.

Comments

0

You can't really use json.loads because hoodFeatures object is not really a json. In a proper json, every key is surrounded with double quotes "

You could try adding quotes around keys manually (using regular expressions).
Another option is using Selenium to execute that JS and get the JSON.stringify output of it.

Answer using manual cleanup:

This one cleans up JS code and turns it into a JSON that can be parsed properly. However keep in mind that it is by no means robust, and may break at any sight of unexpected input.

import json
import re

js = '''
 var hoodFeatures = {
            type: "FeatureCollection",
            features: [
            {
                type: "Feature",
                properties: {
                    name: "Beverlywood",
                    slug: "beverlywood",
                    url: "/neighborhoods/neighborhood/beverlywood/",
                    has_statistics: true,
                    label: 'Rank: 131<br>Population per Sqmi: 7,654',
                    population: "6,080",
                    stratum: "middle"
                },
                geometry: {  }
            }]
        }
'''

if __name__ == '__main__':
    unprefixed = js.split('{', maxsplit=1)[1]
    unsuffixed = unprefixed.rsplit('}', maxsplit=1)[0]
    quotes_replaced = unsuffixed.replace('\'', '"')
    rebraced = f'{{{quotes_replaced}}}'
    keys_quoted = []
    for line in rebraced.splitlines():
        line = re.sub('^\s+([^:]+):', '"\\1":', line)
        keys_quoted.append(line)
    json_raw = '\n'.join(keys_quoted)
    # print(json_raw)
    parsed = json.loads(json_raw)
    for feat in parsed['features']:
        props = feat['properties']
        name, pop = props['name'], int(props['population'].replace(',', ''))
        geo = feat['geometry']
        pop_per_sqm = re.findall('per Sqmi: ([\d,]+)', props['label'])[0].replace(',', '')
        pop_per_sqm = int(pop_per_sqm)

        print(name, pop, pop_per_sqm, geo)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.