1

I am using scrapy to get the integer values for the field10 and field12 for a given ID within the following script :

<script>

    Autoslave.jQuery(function ($) {
        "use strict";
        var map = initMap([

        {"field1": "operational",
        "field2": "operational",
        "field3": "operational",
        "ID": 2,
        "field4": "some text",
        "field5": 48.8732135,
        "field6": 2.3903853,
        "field7": 1,
        "field8": "SPACE",
        "field9": "some text",
        "field10": 4,
        "field10": false,
        "field12": 0}, 

        {"field1": "operational",
        "field2": "operational",
        "field3": "operational",
        "ID": 3,
        "field4": "some text",
        "field5": 48.8592806,
        "field6": 2.3773563,
        "field7": 0,
        "field8": "SPACE",
        "field9": "some text",
        "field10": 2,
        "field11": false,
        "field12": 3},

...

</script>

In scrapy shell, I've succeed to get the script text with response.xpath('//script[14]/text()').extract()but then I don't know how to select my values within the text, for a defined ID. Any ideas how to this (maybe using regex ?)

4
  • Try this stackoverflow.com/questions/29163395/… Commented Feb 22, 2016 at 14:19
  • Do you know what would be the regex pattern in my case? Thanks! Commented Feb 22, 2016 at 14:21
  • I'm not sure what you're trying to extract. What does your xpath return exactly? And how do you want it to look? Commented Feb 22, 2016 at 14:24
  • My current xpath returns the text above. For a given ID, let's say 2 here, I want to get the linked "field10" and/or "field12" values, which are 4 and 0 in this case Commented Feb 22, 2016 at 14:33

1 Answer 1

1

This solution doesn't use regex but since the script has json in it, I would use python's json module to get the required field. I will assume that there isn't any other variable except var map.

script =  ''.join(response.xpath('//script[14]/text()').extract())
json_data = script.split("initMap(")[1].replace("</script>","")[:-1]
data = json.loads('{"data":'+json_data+'}')
fields = data["data"]
for f in fields:
    id = f["ID"]
    field10 = f["field10"]
    field12 = f["field12"]
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks ! I think using JSON is a good solution. However I got the following error : TypeError: replace() takes at least 2 arguments (1 given) ?
Thanks, I managed to get the JSON data ! But the line data = json.loads({"data":json_data}) returns me : TypeError: expected string or buffer
Updated it. It was happening because json.loads takes string or buffer as argument and we were passing a dict . Try now. It should works.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.