0

I retrieved a bunch of text records from my postgresql database and intend to preprocess these text documents before analyzing them.

I want to tokenize the documents but ran into some problem during tokenizing

    #some other bunch of regex replacements
    #toToken is the text string    
    toTokens = self.regexClitics1.sub(" \\1",toTokens)                   
    toTokens = self.regexClitics2.sub(" \\1 \\2",toTokens)

    toTokens = str.strip(toTokens)

The error is TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode' I'm curious, why does this error occurs, when the encoding of the database is UTF-8?

1 Answer 1

4

Why don't you use toTokens.strip(). No need of str module.

There are 2 string types in Python, str and unicode. Look at this for an explanation.

Sign up to request clarification or add additional context in comments.

3 Comments

+1. A shorter explanation can be found on StackOverflow: stackoverflow.com/questions/4545661/… (shameless plug). :)
does that means that the strings I get from my queries are unicode? Why is that so?
@amateur It seems so. It's strange, because AFAIK psycopg returns str objects unless instructed to do otherwise, but can't know without more information about your setup.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.