Obtaining correct data using Regex/List

Question

I am parsing the following code using a regex (not ideal I know, but that is a story for another day):

data:{
            url: 'stage-team-stat'
        },
        defaultParams: {
            stageId : 9155,
            field: 2,
            teamId: 26
        }
    };

This is being parsed using the following code (where var is the above code):

import re

    stagematch = re.compile("data:\s*{\s*url:\s*'stage-team-stat'\s*},\s*defaultParams:\s*{\s*(.*?),.*},",re.S)

    stagematch2 = re.search(stagematch, var)

        if stagematch2 is not None:
            stagematch3 = stagematch2.group(1)

            stageid = int(stagematch3.split(':', 1)[1])
            stageid = str(stageid)

            teamid = int(stagematch3.split(':', 3)[1])
            teamid = str(teamid)

            print stageid
            print teamid

In this example I would expect stageid to be '9155' and teamid to be '32', however they are both coming back as '9155'.

Can anyone see what I am doing wrong?

Thanks

Instead of dumping all of your code, can you give us an MCVE that has just the few lines that actually matter, and includes the relevant input data instead of making us crawl an entire website to get it? — abarnert
– abarnert, Commented Sep 21, 2014 at 1:08
@user3045351 I guess @abarnert meant to have a code snippet that clearly demonstrates the problem without even scrapy being involved here. Cause, strictly speaking, the question is not scrapy-specific, but more about "how to extract certain fields from a string that is a snippet of javascript code". — alecxe
– alecxe, Commented Sep 21, 2014 at 1:31
It's better now… but still not very good. The code isn't runnable with indentation errors all over the place. And why not just put the variable into the code instead of describing how to do it? — abarnert
– abarnert, Commented Sep 21, 2014 at 1:37
As a side note, if you use re.compile, you get back a regex object that you can use directly: stagematch.match(var), not re.match(stagematch, var). — abarnert
– abarnert, Commented Sep 21, 2014 at 1:38
Meanwhile, when I run this code against that data, the re.search returns None. So, it's not actually demonstrating your problem at all. — abarnert
– abarnert, Commented Sep 21, 2014 at 1:41

alecxe · Accepted Answer · 2014-09-21 01:37:31Z

An alternative solution would be not to dive into regexes, but parse javascript code with a javascript code parser. Example using slimit:

SlimIt is a JavaScript minifier written in Python. It compiles JavaScript into more compact code so that it downloads and runs faster.

SlimIt also provides a library that includes a JavaScript parser, lexer, pretty printer and a tree visitor.

from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor

data = """
var defaultTeamStatsConfigParams = {
        data:{
            url: 'stage-team-stat'
        },
        defaultParams: {
            stageId : 9155,
            field: 2,
            teamId: 32
        }
    };

    DataStore.prime('stage-team-stat', defaultTeamStatsConfigParams.defaultParams, [{"RegionId":252,"RegionCode":"gb-eng","TournamentName":"Premier League","TournamentId":2,"StageId":9155,"Field":{"Value":2,"DisplayName":"Overall"},"TeamName":"Manchester United","TeamId":32,"GamesPlayed":4,"Goals":6,"Yellow":7,"Red":0,"TotalPasses":2480,"Possession":247,"AccuratePasses":2167,"AerialWon":61,"AerialLost":49,"Rating":7.01,"DefensiveRating":7.01,"OffensiveRating":6.79,"ShotsConcededIBox":13,"ShotsConcededOBox":21,"TotalTackle":75,"Interceptions":71,"Fouls":54,"WasFouled":46,"TotalShots":49,"ShotsBlocked":9,"ShotsOnTarget":19,"Dribbles":44,"Offsides":3,"Corners":17,"Throws":73,"Dispossesed":36,"TotalClearance":78,"Turnover":0,"Ranking":0}]);

    var stageStatsConfig = {
        id: 'team-stage-stats',
        singular: true,
        filter: {
                instanceType: WS.Filter,
                id: 'team-stage-stats-filter',
                categories: { data: [{ value: 'field' }] },
                singular: true
        },
        params: defaultTeamStatsConfigParams,
        content: {
            instanceType: TeamStageStats,
            view: {
                renderTo: 'team-stage-stats-content'
            }
        }
    };

    var stageStats = new WS.Panel(stageStatsConfig);
    stageStats.load();
"""

parser = Parser()
tree = parser.parse(data)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
          for node in nodevisitor.visit(tree)
          if isinstance(node, ast.Assign)}

print fields['stageId'], fields['field'], fields['teamId']

Prints 9155 2 32.

Here we are iterating over the syntax tree nodes and constructing a dictionary from all assignments. Among them we have stageId, fields and teamId.

Here is how you can apply the solution to your scrapy spider:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


def get_fields(data):
    parser = Parser()
    tree = parser.parse(data)
    return {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
            for node in nodevisitor.visit(tree)
            if isinstance(node, ast.Assign)}


class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United"]
    download_delay = 5

    rules = [Rule(SgmlLinkExtractor(allow=('http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United'),deny=('/News', '/Graphics', '/Articles', '/Live', '/Matches', '/Explanations', '/Glossary', 'ContactUs', 'TermsOfUse', 'Jobs', 'AboutUs', 'RSS'),), follow=False, callback='parse_item')]

    def parse_item(self, response):
        sel = Selector(response)
        titles = sel.xpath("normalize-space(//title)")
        myheader = titles.extract()[0]

        script = sel.xpath('//div[@id="team-stage-stats"]/following-sibling::script/text()').extract()[0]
        script_fields = get_fields(script)
        print script_fields['stageId'], script_fields['field'], script_fields['teamId']

thanks, that did what i wanted. i wont pretend i understoof all of it straight away though, so i will have to do some further reading. thanks

Collectives™ on Stack Overflow

Obtaining correct data using Regex/List

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related