Parsing json element

Question

I am using Scrapy and a Regex to parse some none standard web source code. I then wish to parse the first element of the dictionary returned:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests


class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 5

    rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]

    def parse_item(self, response):

        sel = Selector(response)
        titles = sel.xpath("normalize-space(//title)")
        print '-' * 170
        myheader = titles.extract()[0]
        print '********** Page Title:', myheader.encode('utf-8'), '**********'
        print '-' * 170

        match1 = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
                     + '(\[.*\])' + re.escape(");"), response.body)


        if match1 is not None:
            playerdata1 = match1.group(1)

            teamid = json.loads(playerdata1[0])
            print teamid

The key for the first element of 'playerdata1' is called 'TeamId'. I assumed the above method would work, however I am getting the following error:

    teamid = json.loads(playerdata1[0])
  File "C:\Python27\lib\json\__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "C:\Python27\lib\json\decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
exceptions.ValueError: Expecting object: line 1 column 1 (char 0)

Can anyone see what the issue is here?

Are you expecting match1.group(1) to be a JSON string? Try teamid = json.loads(playerdata1)[0] instead? — shaktimaan
– shaktimaan, Commented Sep 6, 2014 at 19:05
It would help if you could at least give us a sample URL to test against, one with the DataStore.prime text in it. — Martijn Pieters
– Martijn Pieters, Commented Sep 6, 2014 at 19:17
@MartijnPieters yes, no problem...here is a link...view-source:whoscored.com/Teams/32/… in this example i want the value of variable 'teamid' to equal '32' which is the id for the team on this page. thanks — gdogg371
– gdogg371, Commented Sep 6, 2014 at 19:19

Martijn Pieters · Accepted Answer · 2014-09-06 19:21:09Z

2

match1.group(1) returns one string. You then index that string:

teamid = json.loads(playerdata1[0])

Here, [0] will give you the just the first character of that string. Remove the indexing expression there to use the whole string:

teamid = json.loads(playerdata1)

Now teamid is a list with player objects:

>>> len(teamid)
22
>>> teamid[0].keys()
[u'FirstName', u'LastName', u'KnownName', u'Field', u'GameStarted', u'AerialWon', u'TeamRegionCode', u'SecondYellow', u'ShotsBlocked', u'TotalShots', u'Assists', u'Red', u'Name', u'PositionText', u'Ranking', u'PositionLong', u'PlayerId', u'SubOff', u'Dispossesed', u'TeamId', u'TotalTackles', u'TotalLongBalls', u'Goals', u'SubOn', u'WasDribbled', u'AerialLost', u'Turnovers', u'ShotsOnTarget', u'WSName', u'Fouls', u'ManOfTheMatch', u'Height', u'TeamName', u'RegionCode', u'TotalPasses', u'TotalThroughBalls', u'Dribbles', u'DateOfBirth', u'OwnGoals', u'WasFouled', u'TotalClearances', u'Rating', u'PlayedPositionsRaw', u'Weight', u'AccurateLongBalls', u'OffsidesWon', u'AccuratePasses', u'Yellow', u'KeyPasses', u'TotalCrosses', u'AccurateCrosses', u'IsCurrentPlayer', u'Age', u'PositionShort', u'AccurateThroughBalls', u'Interceptions', u'Offsides']

edited Sep 6, 2014 at 19:21

answered Sep 6, 2014 at 19:13

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

gdogg371 Over a year ago

hi, i'm not sure what you mean by 'Remove the indexing expression there to use the whole string.'. Thanks...

Martijn Pieters Over a year ago

@user3045351: given your URL the above works, it gives you a list with dictionaries, each a player.

gdogg371 Over a year ago

im still confused as to how i can get 'TeamId' resolve to '32' with the above code in the above example. Thanks...

Martijn Pieters Over a year ago

@user3045351: teamid[0]['TeamId'].

Frederic Bazin · Accepted Answer · 2014-09-07 01:54:49Z

0

when I need to query small parts in a complex JSON, I often use ObjectPath.

It has query language that looks like CSS selectors. Check examples at http://adriank.github.io/ObjectPath/

answered Sep 7, 2014 at 1:54

Frederic Bazin

1,52912 silver badges27 bronze badges

Collectives™ on Stack Overflow

Parsing json element

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related