7

My items.py file goes like this:

from scrapy.item import Item, Field

class SpiItem(Item):
    title = Field()
    lat = Field()
    lng = Field()
    add = Field()

and the spider is:

import scrapy
import re

from spi.items import SpiItem

class HdfcSpider(scrapy.Spider):
    name = "hdfc"
    allowed_domains = ["hdfc.com"]
    start_urls = ["http://hdfc.com/branch-locator"]

    def parse(self,response):
        addresses = response.xpath('//script')
        for sel in addresses:
            item = SpiItem()
            item['title'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="title":).+(?=")')
            item['lat'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="latitude":).+(?=")')
            item['lng'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="longitude":).+(?=")')
            item['add'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="html":).+(?=")')
            yield item

The whole javascript code, on viewing page source, is written inside: //html/body/table/tbody/tr[348]/td[2].

Why is my code not working? I want to extract just the four fields mentioned in the items file.

2

1 Answer 1

16

Instead of extracting field by field using regular expressions, extract the complete locations object, load it via json.loads() and extract the desired data from the Python dictionary you'll get:

def parse(self,response):
    pattern = re.compile(r"var locations= ({.*?});", re.MULTILINE | re.DOTALL)
    locations = response.xpath('//script[contains(., "var locations")]/text()').re(pattern)[0]
    locations = json.loads(locations)
    for title, data in locations.iteritems():
        print title
Sign up to request clarification or add additional context in comments.

2 Comments

@Aditya first of all, you don't need to loop over the scripts in the first place - there is only one script you need to locate. Plus, you are basically searching for script tag inside every script tag you've found which, logically, results into nothing being scraped.
@Aditya anyway, I've provided a better and more reliable approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.