1

I have a scrapy code that scrapes a website and writes to MySQL

import MySQLdb.cursors

def __init__(self,stats):
    self.dbpool = adbapi.ConnectionPool(<dbnam>,host=<host>,user=<user>,port=<port>,passwd=<pwd>, db=<dbname>, cursorclass=MySQLdb.cursors.DictCursor, charset='utf8', use_unicode=True)

def process_item(self, item, spider):
    query = self.dbpool.runInteraction(self._conditional_insert, item)
    query.addErrback(self.handle_error)

Scrapy script for a list of numbers in table

item['numbers'] = sites.xpath('//*[@id="numbers-0"]/tbody/tr/td/text()').extract()

I'm scraping the following content: 10″ 11″ 12″ etc. My code returns the following:

'numbers': [u'10\u2033', u'11\u2033', u'12\u2033'],

Inserting this into a MySQL db throws an error message - I'm guessing due to unicode issue.

tx.execute("""INSERT INTO numbers ('{0}').format(", ".join(item['numbers'])))

Could you please help get the insert to succeed. Better still, how can I remove the special character '\u2033' from the list?

Thanks in advance!

6
  • Are you using Python 2 or 3? Commented Mar 12, 2016 at 21:48
  • 2.7.11 Thanks Bernard for looking into this! Commented Mar 12, 2016 at 22:00
  • No worries, would you mind trying to use PyMySQL as opposed to the MySQL connector? Commented Mar 12, 2016 at 22:01
  • No problem at all with moving from MYSQL connector. I am new to Python and Scrapy. Just need to figure out how to use PyMySQL Commented Mar 12, 2016 at 22:04
  • Do exactly the same as you are with the connector, just put pymysql in place. And to install it run sudo pip install PyMySQL. Commented Mar 12, 2016 at 22:06

1 Answer 1

1

You're probably getting a UnicodeEncodeError because you are trying to insert unicode strings containing non-ascii characters into a byte-string.

To fix that, make sure your query string has a u prefix:

tx.execute(u"""INSERT INTO numbers ('{0}')""".format(", ".join(item['numbers'])))

If you really want to get get rid of those double-prime characters, I suppose you could just replace them with double-quotes:

item['numbers'] = [s.replace(u'\u2033', '"') for s in item['numbers']]

But I think it's better to ensure your code can handle whatever unicode characters are thrown at it - which is to say, you should always use unicode strings within your program.

Sign up to request clarification or add additional context in comments.

2 Comments

I cant upvote your answer because I'm new to stackoverflow. Will be back once I earn some credibility! :)
@user6055239. Thanks :) NB: you can always accept answers, which will also earn you a little bit of rep.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.