Issue using Scrapy Spider Output in Python script

Question

I want to use the ouput from a spider inside a python script. To accomplish this, I wrote the following code based on another thread.

The issue I'm facing is that the function spider_results() only returns a list of the last item over and over again instead of a list with all the found items. When I run the same spider manually with the scrapy crawl command, I get the desired output. The output of the script, the manual json output and the spider itself are below.

What's wrong with my code?

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from circus.spiders.circus import MySpider

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)


    dispatcher.connect(crawler_results, signal=signals.item_passed)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

Script output:

{'away_odds': 1.44,
 'away_team': 'Los Angeles Dodgers',
 'event_time': datetime.datetime(2019, 6, 8, 2, 15),
 'home_odds': 2.85,
 'home_team': 'San Francisco Giants',
 'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
 'league': 'MLB'}, {'away_odds': 1.44,
 'away_team': 'Los Angeles Dodgers',
 'event_time': datetime.datetime(2019, 6, 8, 2, 15),
 'home_odds': 2.85,
 'home_team': 'San Francisco Giants',
 'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
 'league': 'MLB'}, {'away_odds': 1.44,
 'away_team': 'Los Angeles Dodgers',
 'event_time': datetime.datetime(2019, 6, 8, 2, 15),
 'home_odds': 2.85,
 'home_team': 'San Francisco Giants',
 'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
 'league': 'MLB'}]

Json output with scrapy crawl:

[
{"home_team": "Los Angeles Angels", "away_team": "Seattle Mariners", "event_time": "2019-06-08 02:07:00", "home_odds": 1.58, "away_odds": 2.4, "last_update": "2019-06-06 20:48:16", "league": "MLB"},
{"home_team": "San Diego Padres", "away_team": "Washington Nationals", "event_time": "2019-06-08 02:10:00", "home_odds": 1.87, "away_odds": 1.97, "last_update": "2019-06-06 20:48:16", "league": "MLB"},
{"home_team": "San Francisco Giants", "away_team": "Los Angeles Dodgers", "event_time": "2019-06-08 02:15:00", "home_odds": 2.85, "away_odds": 1.44, "last_update": "2019-06-06 20:48:16", "league": "MLB"}
]

MySpider:

from scrapy.spiders import Spider
from ..items import MatchItem
import json
import datetime
import dateutil.parser

class MySpider(Spider):
    name = 'first_spider'

    start_urls = ["https://websiteXYZ.com"]

    def parse(self, response):
        item = MatchItem()

        timestamp = datetime.datetime.utcnow()

        response_json = json.loads(response.body)

        for event in response_json["el"]:
            for team in event["epl"]:
                if team["so"] == 1: item["home_team"] = team["pn"]
                if team["so"] == 2: item["away_team"] = team["pn"]

            for market in event["ml"]:
                if market["mn"] == "Match result":
                    item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)
                    for outcome in market["msl"]:
                        if outcome["mst"] == "1": item["home_odds"] = outcome["msp"]
                        if outcome["mst"] == "X": item["draw_odds"] = outcome["msp"]
                        if outcome["mst"] == "2": item["away_odds"] = outcome["msp"]

                if market["mn"] == 'Moneyline':
                    item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)
                    for outcome in market["msl"]:
                        if outcome["mst"] == "1": item["home_odds"] = outcome["msp"]
                        #if outcome["mst"] == "X": item["draw_odds"] = outcome["msp"]
                        if outcome["mst"] == "2": item["away_odds"] = outcome["msp"]


            item["last_update"] = timestamp
            item["league"] = event["scn"]

            yield item

Edit:

Based on the answer below, I tried the following two scripts:

controller.py

import json
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor, defer
from betsson_controlled.spiders.betsson import Betsson_Spider
from scrapy.utils.project import get_project_settings


class MyCrawlerRunner(CrawlerRunner):
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        # create crawler (Same as in base CrawlerProcess)
        crawler = self.create_crawler(crawler_or_spidercls)

        # handle each item scraped
        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        # create Twisted.Deferred launching crawl
        dfd = self._crawl(crawler, *args, **kwargs)

        # add callback - when crawl is done cal return_items
        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items

def return_spider_output(output):
    return json.dumps([dict(item) for item in output])

settings = get_project_settings()
runner = MyCrawlerRunner(settings)
spider = Betsson_Spider()
deferred = runner.crawl(spider)
deferred.addCallback(return_spider_output)


reactor.run()
print(deferred)

When I execute controller.py, I get:

<Deferred at 0x7fb046e652b0 current result: '[{"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}]'>

This is a shot in the dark but they've refactored how the crawler runner works in the newly released Scrapy. See the changes made here in the docs and decide if it may help your cause. You're result indicates that your deferred is working but but somehow the spider is either not finishing or not closing. docs.scrapy.org/en/1.7/news.html — ThePyGuy
– ThePyGuy, Commented Jul 18, 2019 at 23:34
Thanks for thinking of me. I'll look into it. Not sure if I'll keep using Scrapy for this project at all, if it's that complicated to implement such a simple functionality. — Chris1309
– Chris1309, Commented Jul 20, 2019 at 11:46
I know that my answer is the correct answer we are just missing something. I have this code running production on an API endpoint. but i know the feeling when trying to figure something like this out. Making a requests implementation with all the items and features of scrapy to run concurrently though would probably be as difficult as resolving this issue. We at least know that deferred is working as a callback so you should be able to troubleshoot the problem from here. — ThePyGuy
– ThePyGuy, Commented Jul 20, 2019 at 18:06
Try to run your code in a crawl function like I did in the last piece of code with the defer callbacks decorator see if that does anything. I think you may have to stop the reactor for the code to finish executing. reactor.run() is supposed to block until the script is done but its never finishing. Once its done all your items should be in the deferred variable.... — ThePyGuy
– ThePyGuy, Commented Jul 20, 2019 at 18:08
upodated answer with another stab at it...try crawlerprocess instead of runner it seems more what you need where as I needed runner. — ThePyGuy
– ThePyGuy, Commented Jul 24, 2019 at 9:30

ThePyGuy · Accepted Answer · 2019-07-24 09:35:45Z

1

RECENT EDITS: After reading CrawlerProcess vs CrawlerRunner I realized that you probably want CrawlerProcess. I had to use runner since I needed klein to be able to use the deferred object. Process expects only scrapy where was runner expects other scripts/programs to interact with. Hope this helpss.

You need to modify CrawlerRunner/Process and use signals and or callbacks to pass the item into your script in the CrawlerRunner.

How to integrate Flask & Scrapy? If you look at the options in the top answer the one with twisted klein and scrapy is an example of what you are looking for since it is doing the same thing except sending it to a Klein http server after the crawl. You can setup a similar method with the CrawlerRunner to send each item to your script as it is crawling. NOTE: This particular question sends the results to Klein web server after the items are collected. The answer is for making an API which collects the results and waits until crawling is done and sends it as dumps it to JSON, but you can apply this same method to your situation. The main thing to look at is how CrawlerRunner was sub-classed and extended to add the extra functionality.

What you want to be doing is have a separate script which you execute which imports your Spider and extends CrawlerRunner. Then you execute this script it will start your Twisted reactor and start the crawl process using your cutomized runner.

That said -- this problem could probably be solved in an item pipeline. Create a custom item pipeline and pass the item into your script before returning the item.

# main.py

import json
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from twisted.internet import reactor, defer # import we missed
from myproject.spiders.mymodule import MySpiderName
from scrapy.utils.project import get_project_settings


class MyCrawlerProcess(CrawlerProcess):
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        crawler = self.create_crawler(crawler_or_spidercls)

        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        dfd = self._crawl(crawler, *args, **kwargs)

        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items


def return_spider_output(output):
    return json.dumps([dict(item) for item in output])


process = MyCrawlerProcess()
deferred = process.crawl(MySpider)
deferred.addCallback(return_spider_output)


process.start() - Script should block here again but I'm not sure if it will work right without using reactor.run()
print(deferred)

Again, this code is a guess I havent tested. I hope it sets you in a better direction.

References:

edited Jul 24, 2019 at 9:35

answered Jun 29, 2019 at 16:37

ThePyGuy

1,0351 gold badge7 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

31 Comments

Chris1309 Over a year ago

I think using the item pipeline wouldn't help, since I need all items in one list inside the other script and I don't want to run a particular script on each item? I would think there must be a simpler solution than the one propsed inside the mentioned thread. Wanting to use the scraped data inside a script without prior writing to a database shouldn't be that uncommon.

ThePyGuy Over a year ago

So you want to pass them when its done all at the same time? If that's the case the solution I showed you with the klein API is what you need. You could also chain the two commands together scrapy crawl foo -o bar.csv && python foobar bar.csv . Scrapy is async so it makes things different. My answer about creating your own CrawlerRunner to gather up the items I believe is correct. If you want to pass them one at a time the item pipeline would work fine.

Chris1309 Over a year ago

I want to have one main_script from which I can: 1) run different crawlers, which return a list with all the found items back to main_script when they are done 2) process the data 3) repeat

Chris1309 Over a year ago

I edited my post above. It would be nice if you could take a look,

ThePyGuy Over a year ago

See changes I made. This should work okay. It should run collect all items and then they should be available in deferred.

|

Collectives™ on Stack Overflow

Issue using Scrapy Spider Output in Python script

1 Answer 1

31 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

31 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related