1

I am using Scrapy and a Regex to parse some none standard web source code. I then wish to parse the first element of the dictionary returned:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests


class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 5

    rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]

    def parse_item(self, response):

        sel = Selector(response)
        titles = sel.xpath("normalize-space(//title)")
        print '-' * 170
        myheader = titles.extract()[0]
        print '********** Page Title:', myheader.encode('utf-8'), '**********'
        print '-' * 170

        match1 = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
                     + '(\[.*\])' + re.escape(");"), response.body)


        if match1 is not None:
            playerdata1 = match1.group(1)

            teamid = json.loads(playerdata1[0])
            print teamid

The key for the first element of 'playerdata1' is called 'TeamId'. I assumed the above method would work, however I am getting the following error:

    teamid = json.loads(playerdata1[0])
  File "C:\Python27\lib\json\__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "C:\Python27\lib\json\decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
exceptions.ValueError: Expecting object: line 1 column 1 (char 0)

Can anyone see what the issue is here?

3
  • 1
    Are you expecting match1.group(1) to be a JSON string? Try teamid = json.loads(playerdata1)[0] instead? Commented Sep 6, 2014 at 19:05
  • It would help if you could at least give us a sample URL to test against, one with the DataStore.prime text in it. Commented Sep 6, 2014 at 19:17
  • @MartijnPieters yes, no problem...here is a link...view-source:whoscored.com/Teams/32/… in this example i want the value of variable 'teamid' to equal '32' which is the id for the team on this page. thanks Commented Sep 6, 2014 at 19:19

2 Answers 2

2

match1.group(1) returns one string. You then index that string:

teamid = json.loads(playerdata1[0])

Here, [0] will give you the just the first character of that string. Remove the indexing expression there to use the whole string:

teamid = json.loads(playerdata1)

Now teamid is a list with player objects:

>>> len(teamid)
22
>>> teamid[0].keys()
[u'FirstName', u'LastName', u'KnownName', u'Field', u'GameStarted', u'AerialWon', u'TeamRegionCode', u'SecondYellow', u'ShotsBlocked', u'TotalShots', u'Assists', u'Red', u'Name', u'PositionText', u'Ranking', u'PositionLong', u'PlayerId', u'SubOff', u'Dispossesed', u'TeamId', u'TotalTackles', u'TotalLongBalls', u'Goals', u'SubOn', u'WasDribbled', u'AerialLost', u'Turnovers', u'ShotsOnTarget', u'WSName', u'Fouls', u'ManOfTheMatch', u'Height', u'TeamName', u'RegionCode', u'TotalPasses', u'TotalThroughBalls', u'Dribbles', u'DateOfBirth', u'OwnGoals', u'WasFouled', u'TotalClearances', u'Rating', u'PlayedPositionsRaw', u'Weight', u'AccurateLongBalls', u'OffsidesWon', u'AccuratePasses', u'Yellow', u'KeyPasses', u'TotalCrosses', u'AccurateCrosses', u'IsCurrentPlayer', u'Age', u'PositionShort', u'AccurateThroughBalls', u'Interceptions', u'Offsides']
Sign up to request clarification or add additional context in comments.

4 Comments

hi, i'm not sure what you mean by 'Remove the indexing expression there to use the whole string.'. Thanks...
@user3045351: given your URL the above works, it gives you a list with dictionaries, each a player.
im still confused as to how i can get 'TeamId' resolve to '32' with the above code in the above example. Thanks...
@user3045351: teamid[0]['TeamId'].
0

when I need to query small parts in a complex JSON, I often use ObjectPath.

It has query language that looks like CSS selectors. Check examples at http://adriank.github.io/ObjectPath/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.