How to use re() to extract data from javascript variable using scrapy?

Question

My items.py file goes like this:

from scrapy.item import Item, Field

class SpiItem(Item):
    title = Field()
    lat = Field()
    lng = Field()
    add = Field()

and the spider is:

import scrapy
import re

from spi.items import SpiItem

class HdfcSpider(scrapy.Spider):
    name = "hdfc"
    allowed_domains = ["hdfc.com"]
    start_urls = ["http://hdfc.com/branch-locator"]

    def parse(self,response):
        addresses = response.xpath('//script')
        for sel in addresses:
            item = SpiItem()
            item['title'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="title":).+(?=")')
            item['lat'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="latitude":).+(?=")')
            item['lng'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="longitude":).+(?=")')
            item['add'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="html":).+(?=")')
            yield item

The whole javascript code, on viewing page source, is written inside: //html/body/table/tbody/tr[348]/td[2].

Why is my code not working? I want to extract just the four fields mentioned in the items file.

Please fix your indentation.

kylieCatt
– kylieCatt

2015-06-01 13:00:24 +00:00
Commented Jun 1, 2015 at 13:00 — kylieCatt
– kylieCatt, Commented Jun 1, 2015 at 13:00
docs.scrapy.org/en/latest/topics/…

leo
– leo

2021-07-04 11:29:08 +00:00
Commented Jul 4, 2021 at 11:29 — leo
– leo, Commented Jul 4, 2021 at 11:29

alecxe · Accepted Answer · 2015-06-01 13:20:58Z

16

Instead of extracting field by field using regular expressions, extract the complete locations object, load it via json.loads() and extract the desired data from the Python dictionary you'll get:

def parse(self,response):
    pattern = re.compile(r"var locations= ({.*?});", re.MULTILINE | re.DOTALL)
    locations = response.xpath('//script[contains(., "var locations")]/text()').re(pattern)[0]
    locations = json.loads(locations)
    for title, data in locations.iteritems():
        print title

edited Jun 1, 2015 at 13:20

answered Jun 1, 2015 at 13:11

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

alecxe Over a year ago

@Aditya first of all, you don't need to loop over the scripts in the first place - there is only one script you need to locate. Plus, you are basically searching for script tag inside every script tag you've found which, logically, results into nothing being scraped.

alecxe Over a year ago

@Aditya anyway, I've provided a better and more reliable approach.

Collectives™ on Stack Overflow

How to use re() to extract data from javascript variable using scrapy?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related