0

I'm trying to extract the content of a script tag on a store locator using Scrapy, but I'm a bit stuck.

Within view source, the script content looks like this:

<script>
    var map_locations = [{"col_id":"1","col_postcode":"DN18 5DE","col_latitude":"53.6825556","col_longitude":"-0.438675","col_address1":"9a Market Lane","col_name":"XX","col_website":"https:\/\/branches.XX.co.uk\/barton-upon-humber\/9a-market-lane.html?type=0&stores=DN18+5DE?utm_source=directories&utm_medium=local&utm_campaign=yext&utm_content=1444","col_facebook":"https:\/\/www.facebook.com\/XXDN185DE\/","col_city":"Barton-Upon-Humber","col_state":"North Lincolnshire","col_yextid":"1444"}...
</script>

I copied the xpath and used response.xpath('/html/body/script[1]/text()') to retrieve it within the terminal

Now I want to parse the information in the script into separate columns, which I'll eventually load into csv.

How should I go about parsing that information? Say if I wanted the col_postcode? I've read other posts where people use regex & json.

2
  • So, u looking for an alternate solution other than loading using regex & json Commented May 24, 2020 at 11:38
  • No I'm looking for any solution! It would be great to know how I should approach this Commented May 24, 2020 at 11:49

1 Answer 1

2

.* captures zero or more character's enclosed inside []

import re
import json

# response.xpath will return list of 'Selector' Object & calling extract return the extracted string.
for script in response.xpath("/html/body/script[1]/text()").extract():

    search_ = re.search("\[(.*)\]", script)
    # if multiple script tag's exists, find only which matches the condition.
    if search_:
        for doc in json.loads(search_.group()):
            print(doc['col_postcode'])

Output

DN18 5DE
Sign up to request clarification or add additional context in comments.

3 Comments

Hi, when entering this I get: TypeError: expected string or bytes-like object
Thanks Sushanth, this works! One question I have is around putting this information back into my Spider with Scrapy. How should this be coded? import scrapy class bb_spider(scrapy.Spider): name = "stores" start_urls = [ 'xx.co.uk/store-locator' ] def parse(self, response): for script in response.xpath("/html/body/script[1]/text()").extract(): yield { 'post_code' : stores.xpath } I'm stuck on the yield part! Thanks
Great, happy to help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.