16

I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())


d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()

def spider_output(output):
#     do something to that output

How can i get spider output in 'spider_output' method. It is possible to get output/results.

5 Answers 5

31

Here is the solution that get all output/results in a list

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    dispatcher.connect(crawler_results, signal=signals.item_scraped)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())
Sign up to request clarification or add additional context in comments.

3 Comments

This doesn't seem to work for me, do you have a pipeline working or anything?
It's works to me,, just note MySpider is your spider class.. it's very helpful for beginner..
FYI as of Scrapy 0.14 item_passed was renamed to item_scraped. Source :docs.scrapy.org/en/latest/news.html older item_passed docs: docs.scrapy.org/en/0.9/topics/signals.html#item-passed new item_scaped docs: docs.scrapy.org/en/latest/topics/signals.html#item-scraped
8

This is an old question, but for future reference. If you are working with python 3.6+ I recommend using scrapyscript that allows you to run your Spiders and get the results in a super simple way:

from scrapyscript import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json

# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
    name = 'myspider'

    def start_requests(self):
        yield Request(self.url)

    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        return {'url': response.request.url, 'title': title}

# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')

# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None)

# Start the reactor, and block until all spiders complete.
data = processor.run([githubJob, pythonJob])

# Print the consolidated results
print(json.dumps(data, indent=4))
[
    {
        "title": [
            "Welcome to Python.org"
        ],
        "url": "https://www.python.org/"
    },
    {
        "title": [
            "The world's leading software development platform \u00b7 GitHub",
            "1clr-code-hosting"
        ],
        "url": "https://github.com/"
    }
]

Comments

1

AFAIK there is no way to do this, since crawl():

Returns a deferred that is fired when the crawling is finished.

And the crawler doesn't store results anywhere other than outputting them to logger.

However returning ouput would conflict with the whole asynchronious nature and structure of scrapy, so saving to file then reading it is a prefered approach here.
You can simply devise pipeline that saves your items to file and simply read the file in your spider_output. You will receive your results since reactor.run() is blocking your script untill the output file is complete anyways.

2 Comments

Yes you are right crawler doesn't store results but using signals we can get results
@SheikhJames Oh right, forgot about signals completely. That's very clever!
0

My advice is to use the Python subprocess module to run spider from the script rather than using the method provided in the scrapy docs to run spider from python script. The reason for that is that with the subprocess module, you can capture the output/logs and even statements that you print from inside the spider.

In Python 3, execute the spider with the run method. Ex.

import subprocess
process = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if process.returncode == 0:
    result = process.stdout.decode('utf-8')
else:
    # code to check error using 'process.stderr'

Setting the stdout/stderr to subprocess.PIPE will allow capture of output so it's very important to set this flag. Here command should be a sequence or a string (It it's a string, then call the run method with 1 more param: shell=True). For example:

command = ['scrapy', 'crawl', 'website', '-a', 'customArg=blahblah']
# or
command = 'scrapy crawl website -a customArg=blahblah' # with shell=True
#or
import shlex
command = shlex.split('scrapy crawl website -a customArg=blahblah') # without shell=True

Also, process.stdout will contain the output from the script but it will be of type bytes. You need to convert it to str using decode('utf-8')

Comments

0

It's going to return all the results of a Spider within a list.

from scrapyscript import Job, Processor
from scrapy.utils.project import get_project_settings


def get_spider_output(spider, **kwargs):
    job = Job(spider, **kwargs)
    processor = Processor(settings=get_project_settings())
    return processor.run([job])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.