encoding problem with pgsql/python?

Question

I retrieved a bunch of text records from my postgresql database and intend to preprocess these text documents before analyzing them.

I want to tokenize the documents but ran into some problem during tokenizing

    #some other bunch of regex replacements
    #toToken is the text string    
    toTokens = self.regexClitics1.sub(" \\1",toTokens)                   
    toTokens = self.regexClitics2.sub(" \\1 \\2",toTokens)

    toTokens = str.strip(toTokens)

The error is TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode' I'm curious, why does this error occurs, when the encoding of the database is UTF-8?

Samuel · Accepted Answer · 2011-06-23 07:13:14Z

4

Why don't you use toTokens.strip(). No need of str module.

There are 2 string types in Python, str and unicode. Look at this for an explanation.

answered Jun 23, 2011 at 7:13

Samuel

2,5002 gold badges19 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Eric O. Lebigot Over a year ago

+1. A shorter explanation can be found on StackOverflow: stackoverflow.com/questions/4545661/… (shameless plug). :)

goh Over a year ago

does that means that the strings I get from my queries are unicode? Why is that so?

Samuel Over a year ago

@amateur It seems so. It's strange, because AFAIK psycopg returns str objects unless instructed to do otherwise, but can't know without more information about your setup.

Collectives™ on Stack Overflow

encoding problem with pgsql/python?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related