Get Scrapy crawler output/results in script file function

Question

I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())


d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()

def spider_output(output):
#     do something to that output

How can i get spider output in 'spider_output' method. It is possible to get output/results.

Alex Watt · Accepted Answer · 2021-07-22 16:08:29Z

31

Here is the solution that get all output/results in a list

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    dispatcher.connect(crawler_results, signal=signals.item_scraped)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

edited Jul 22, 2021 at 16:08

Alex Watt

9375 silver badges14 bronze badges

answered Oct 25, 2016 at 13:01

Ahsan aslam

1,1993 gold badges16 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alex McLean Over a year ago

This doesn't seem to work for me, do you have a pipeline working or anything?

Budi Mulyo Over a year ago

It's works to me,, just note MySpider is your spider class.. it's very helpful for beginner..

RayB Over a year ago

FYI as of Scrapy 0.14 item_passed was renamed to item_scraped. Source :docs.scrapy.org/en/latest/news.html older item_passed docs: docs.scrapy.org/en/0.9/topics/signals.html#item-passed new item_scaped docs: docs.scrapy.org/en/latest/topics/signals.html#item-scraped

wiltonsr · Accepted Answer · 2020-07-14 19:25:49Z

This is an old question, but for future reference. If you are working with python 3.6+ I recommend using scrapyscript that allows you to run your Spiders and get the results in a super simple way:

from scrapyscript import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json

# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
    name = 'myspider'

    def start_requests(self):
        yield Request(self.url)

    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        return {'url': response.request.url, 'title': title}

# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')

# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None)

# Start the reactor, and block until all spiders complete.
data = processor.run([githubJob, pythonJob])

# Print the consolidated results
print(json.dumps(data, indent=4))

[
    {
        "title": [
            "Welcome to Python.org"
        ],
        "url": "https://www.python.org/"
    },
    {
        "title": [
            "The world's leading software development platform \u00b7 GitHub",
            "1clr-code-hosting"
        ],
        "url": "https://github.com/"
    }
]

Granitosaurus · Accepted Answer · 2016-10-25 12:15:07Z

1

AFAIK there is no way to do this, since crawl():

Returns a deferred that is fired when the crawling is finished.

And the crawler doesn't store results anywhere other than outputting them to logger.

However returning ouput would conflict with the whole asynchronious nature and structure of scrapy, so saving to file then reading it is a prefered approach here.
You can simply devise pipeline that saves your items to file and simply read the file in your spider_output. You will receive your results since reactor.run() is blocking your script untill the output file is complete anyways.

answered Oct 25, 2016 at 12:15

Granitosaurus

21.6k6 gold badges64 silver badges88 bronze badges

2 Comments

Ahsan aslam Over a year ago

Yes you are right crawler doesn't store results but using signals we can get results

Granitosaurus Over a year ago

@SheikhJames Oh right, forgot about signals completely. That's very clever!

d3p4n5hu · Accepted Answer · 2018-08-10 21:41:54Z

My advice is to use the Python subprocess module to run spider from the script rather than using the method provided in the scrapy docs to run spider from python script. The reason for that is that with the subprocess module, you can capture the output/logs and even statements that you print from inside the spider.

In Python 3, execute the spider with the run method. Ex.

import subprocess
process = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if process.returncode == 0:
    result = process.stdout.decode('utf-8')
else:
    # code to check error using 'process.stderr'

Setting the stdout/stderr to subprocess.PIPE will allow capture of output so it's very important to set this flag. Here command should be a sequence or a string (It it's a string, then call the run method with 1 more param: shell=True). For example:

command = ['scrapy', 'crawl', 'website', '-a', 'customArg=blahblah']
# or
command = 'scrapy crawl website -a customArg=blahblah' # with shell=True
#or
import shlex
command = shlex.split('scrapy crawl website -a customArg=blahblah') # without shell=True

Also, process.stdout will contain the output from the script but it will be of type bytes. You need to convert it to str using decode('utf-8')

Kenny Aires · Accepted Answer · 2021-03-07 20:55:44Z

0

It's going to return all the results of a Spider within a list.

from scrapyscript import Job, Processor
from scrapy.utils.project import get_project_settings


def get_spider_output(spider, **kwargs):
    job = Job(spider, **kwargs)
    processor = Processor(settings=get_project_settings())
    return processor.run([job])

answered Mar 7, 2021 at 20:55

Kenny Aires

1,44814 silver badges17 bronze badges

Collectives™ on Stack Overflow

Get Scrapy crawler output/results in script file function

5 Answers 5

3 Comments

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related