2

I am trying to save scrapped data in MySQL database. My script.py is

 # -*- coding: utf-8 -*-
import scrapy
import unidecode
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html


class ElementSpider(scrapy.Spider):
    name = 'books'
    download_delay = 3
    allowed_domains = ["goodreads.com"]
    start_urls = ["https://www.goodreads.com/list/show/19793.I_Marked_My_Calendar_For_This_Book_s_Release",]

    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="next_page"]',)), callback="parse", follow= True),)

    def parse(self, response):
        for href in response.xpath('//div[@id="all_votes"]/table[@class="tableList js-dataTooltip"]/tr/td[2]/div[@class="js-tooltipTrigger tooltipTrigger"]/a/@href'):       
            full_url = response.urljoin(href.extract())
            print full_url
            yield scrapy.Request(full_url, callback = self.parse_books)
            break;


        next_page = response.xpath('.//a[@class="next_page"]/@href').extract()
        if next_page:
            next_href = next_page[0]
            next_page_url = 'https://www.goodreads.com' + next_href
            print next_page_url
            request = scrapy.Request(next_page_url, self.parse)
            yield request

    def parse_books(self, response):
        yield{
            'url': response.url,
            'title':response.xpath('//div[@id="metacol"]/h1[@class="bookTitle"]/text()').extract(),
            'link':response.xpath('//div[@id="metacol"]/h1[@class="bookTitle"]/a/@href').extract(),
        } 

And pipeline.py is

   # -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


import MySQLdb
import hashlib
from scrapy.exceptions import DropItem

from scrapy.http import Request
import sys

class SQLStore(object):
    def __init__(self):
        self.conn = MySQLdb.connect("localhost","root","","books" )
        self.cursor = self.conn.cursor()
        print "connected to DB"

    def process_item(self, item, spider):
        print "hi"

        try:
            self.cursor.execute("""INSERT INTO books_data(next_page_url) VALUES (%s)""", (item['url']))
            self.conn.commit()

        except Exception, e:
            print e

When i run the script there is no error. Spider running well but I think cursor not points to process_item. Even it not print hi.

0

1 Answer 1

2

Your method signature is wrong, it should take item and spider parameters:

process_item(self, item, spider)

Also you need to have the pipeline setup in your settings.py file:

 ITEM_PIPELINES = {"project_name.path.SQLStore"}

Your syntax is also incorrect, you need to pass a tuple:

  self.cursor.execute("""INSERT INTO books_data(next_page_url) VALUES (%s)""", 
    (item['url'],) # <- add ,
Sign up to request clarification or add additional context in comments.

20 Comments

Already have tried this but not working. I added pipeline in setting.py like this ITEM_PIPELINES = { 'test1.pipelines.SQLStore': 300, }
What is in your init.py file in your piplines directory? Also do you have process_item(self, item, spider)?
Then how does scrapy find your SQLStore pipeline?
I mean is your file actually called pipeline or pipelines? You have pipeline in your question and pipelines above. Also where do you yield any items?
well look at my pipeline.py when I print something inside def __init__(self): and when i print inside def process_item(self): it prints nothing. means def process_item(self): not callable.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.