Using Scrapy to extract script data with regex

Question

I'm trying to extract the content of a script tag on a store locator using Scrapy, but I'm a bit stuck.

Within view source, the script content looks like this:

<script>
    var map_locations = [{"col_id":"1","col_postcode":"DN18 5DE","col_latitude":"53.6825556","col_longitude":"-0.438675","col_address1":"9a Market Lane","col_name":"XX","col_website":"https:\/\/branches.XX.co.uk\/barton-upon-humber\/9a-market-lane.html?type=0&stores=DN18+5DE?utm_source=directories&utm_medium=local&utm_campaign=yext&utm_content=1444","col_facebook":"https:\/\/www.facebook.com\/XXDN185DE\/","col_city":"Barton-Upon-Humber","col_state":"North Lincolnshire","col_yextid":"1444"}...
</script>

I copied the xpath and used response.xpath('/html/body/script[1]/text()') to retrieve it within the terminal

Now I want to parse the information in the script into separate columns, which I'll eventually load into csv.

How should I go about parsing that information? Say if I wanted the col_postcode? I've read other posts where people use regex & json.

So, u looking for an alternate solution other than loading using regex & json — sushanth
– sushanth, Commented May 24, 2020 at 11:38
No I'm looking for any solution! It would be great to know how I should approach this — Will Fletcher
– Will Fletcher, Commented May 24, 2020 at 11:49

sushanth · Accepted Answer · 2020-05-25 02:32:44Z

2

.* captures zero or more character's enclosed inside []

import re
import json

# response.xpath will return list of 'Selector' Object & calling extract return the extracted string.
for script in response.xpath("/html/body/script[1]/text()").extract():

    search_ = re.search("\[(.*)\]", script)
    # if multiple script tag's exists, find only which matches the condition.
    if search_:
        for doc in json.loads(search_.group()):
            print(doc['col_postcode'])

Output

DN18 5DE

edited May 25, 2020 at 2:32

answered May 24, 2020 at 12:14

sushanth

8,2923 gold badges20 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Will Fletcher Over a year ago

Hi, when entering this I get: TypeError: expected string or bytes-like object

Will Fletcher Over a year ago

Thanks Sushanth, this works! One question I have is around putting this information back into my Spider with Scrapy. How should this be coded? import scrapy class bb_spider(scrapy.Spider): name = "stores" start_urls = [ 'xx.co.uk/store-locator' ] def parse(self, response): for script in response.xpath("/html/body/script[1]/text()").extract(): yield { 'post_code' : stores.xpath } I'm stuck on the yield part! Thanks

sushanth Over a year ago

Great, happy to help.

Collectives™ on Stack Overflow

Using Scrapy to extract script data with regex

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related