1

I have combined some tutorials on web scraping and made a simple web crawler that is scraping new posted question here on SO. I want to load them into mine postgresql data base, but I am having trouble with a decoding error that my crawler is showing me.

Error:

2015-06-09 06:07:10+0200 [stack] ERROR: Error processing {'title': u'Laravel 5 Confused when implements ShoudlQueue',
     'url': u'/questions/30722718/laravel-5-confused-when-implements-shoudlqueue'}
    Traceback (most recent call last):
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/scrapy/middleware.py", line 62, in _process_chain
        return process_chain(self.methods[methodname], obj, *args)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 65, in process_chain
        d.callback(input)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 393, in callback
        self._startRunCallbacks(result)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/petarp/Documents/PyScraping/RealPython/WebScraping/stack/stack/pipelines.py", line 27, in process_item
        session.commit()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 790, in commit
        self.transaction.commit()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 392, in commit
        self._prepare_impl()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 372, in _prepare_impl
        self.session.flush()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2004, in flush
        self._flush(objects)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2122, in _flush
        transaction.rollback(_capture_exception=True)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
        compat.reraise(exc_type, exc_value, exc_tb)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2086, in _flush
        flush_context.execute()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
        rec.execute(self)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
        uow
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 174, in save_obj
        mapper, table, insert)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 761, in _emit_insert_statements
        execute(statement, params)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
        return meth(self, multiparams, params)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
        return connection._execute_clauseelement(self, multiparams, params)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
        compiled_sql, distilled_params
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
        context)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
        exc_info
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
        reraise(type(exception), exception, tb=exc_tb)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
        context)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
        cursor.execute(statement, parameters)
    sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "url" of relation "reals" does not exist
    LINE 1: INSERT INTO reals (title, url) VALUES ('Laravel 5 Confused w...
                                      ^
     [SQL: 'INSERT INTO reals (title, url) VALUES (%(title)s, %(url)s) RETURNING reals.id'] [parameters: {'url': u'/questions/30722718/laravel-5-confused-when-implements-shoudlqueue', 'title': u'Laravel 5 Confused when implements ShoudlQueue'}]

I have used sqlalchemy to define connection between crawler and postgresql. Here are settings.py, models.py and pipelines.py.

Settings.py:

BOT_NAME = 'stack'

SPIDER_MODULES = ['stack.spiders']
NEWSPIDER_MODULE = 'stack.spiders'
ITEM_PIPELINES = ['stack.pipelines.StackPipeline']
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'stack (+http://www.yourdomain.com)'
DATABASE = {
    'drivername': 'postgres',
    'host': 'localhost',
    'port': '5432',
    'username': '********',
    'password': '********',
    'database': '********'
}

Models.py:

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL

import settings


DeclarativeBase = declarative_base()


def db_connect():
    """ Performs database connections using database settings from settings.py
        Returns sqlalchemy engine instance
    """
    return create_engine(URL(**settings.DATABASE))


def create_reals_table(engine):
    """"""
    DeclarativeBase.metadata.create_all(engine)


class Reals(DeclarativeBase):
    """SQLAlchemy Reals Model"""
    __tablename__ = 'reals'

    id = Column(Integer, primary_key=True)
    title = Column('title', String)
    url = Column('url', String, nullable=True)

Pipeline.py:

from sqlalchemy.orm import sessionmaker
from models import Reals, db_connect, create_reals_table


class StackPipeline(object):
    """ Stack Exchange pipeline for storing scraped items in the database """
    def __init__(self):
        """ Initialize database connection and sessionmaker """
        engine = db_connect()
        create_reals_table(engine)
        self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
        """Save reals in database.
        This method is called for every item pipeline componenet."""
        session = self.Session()
        real = Reals(**item)

        try:
            session.add(real)
            session.commit()
        except:
            session.rollback()
            raise
        finally:
            session.close()
        return item

Shema for real table:

realpython=# select * from reals limit 5;
 id | title | link 
----+-------+------
(0 rows)

Can someone help me understand what is going one in here, and decode this?

5
  • Could you check that your reals table actually contains all the required columns? Commented Jun 9, 2015 at 4:49
  • 1
    provide the schema of reals it looks like there is no column url in it Commented Jun 9, 2015 at 4:54
  • Yes you are right, there is no column url in it. Commented Jun 9, 2015 at 4:55
  • edited mine question, should I drop and recreate the table? Commented Jun 9, 2015 at 4:57
  • 1
    No just give the url input to the link column Commented Jun 9, 2015 at 5:27

2 Answers 2

2

The error message is actually self-explanatory — you just have to look at the last few lines:

sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "url" of relation "reals" does not exist

So, you either need to change your SQL to insert into a column named link, or you need to rename the column in the table with ALTER TABLE reals RENAME COLUMN link TO url;.

Sign up to request clarification or add additional context in comments.

Comments

1

I have found the solution.

The problem was in the url, link definition in my Items.py, I have defined it like this, and in mine models I'm creating a schema table whit link, so I just replace url whit link and the data is loaded successful into the postgresql.

from scrapy import Item, Field


    class StackItem(Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = Field()
        url = Field()

New Items.py:

from scrapy import Item, Field


class StackItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = Field()
    link = Field()

The desired result:

 id |                                 title                                  |                                          link                                          
----+------------------------------------------------------------------------+----------------------------------------------------------------------------------------
  1 | pointcut execution for specific class constructor                      | /questions/30723494/pointcut-execution-for-specific-class-constructor
  2 | PWX-00001 Error opening repository “dtlmsg.txt”. RCs = 268/150/2       | /questions/30723493/pwx-00001-error-opening-repository-dtlmsg-txt-rcs-268-150-2
  3 | Can anyone share a sample c++ program, that reads ASCII stl type file? | /questions/30723491/can-anyone-share-a-sample-c-program-that-reads-ascii-stl-type-file
  4 | Where should I do the core logic code in express js?                   | /questions/30723487/where-should-i-do-the-core-logic-code-in-express-js
  5 | configuring rails application to make ui router work                   | /questions/30723485/configuring-rails-application-to-make-ui-router-work
(5 rows)

1 Comment

And yes I have needed to alter the schema url column to be link.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.