String processing error: UnicodeDecodeError: 'utf8' codec can't decode

Question

I'm trying to analyze the a series of passwords for frequency. My script is working with other input media, however it appears that there's some bad characters in my current data set. How can I get around the "bad" data?

import re
import collections 
words = re.findall('\w+', open('rockyou.txt').read().lower())
a=collections.Counter(words).most_common(50)
for word in a:
     print(word)

I then get the error:

Traceback (most recent call last):
  File "shakecount.py", line 3, in <module>
    words = re.findall('\w+', open('rockyou.txt').read().lower().ASCII)
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 5079963: invalid continuation byte

Any ideas?

agf · Accepted Answer · 2012-04-11 21:31:55Z

5

Your code doesn't exactly match your error (I assume an attempt at debugging?), but your text file isn't UTF-8.

You need to manually specify an encoding, with my best guess being latin-1:

words = re.findall('\w+', open('rockyou.txt', encoding='latin-1').read().lower())

if you want to continue despite errors, you can pass errors='ignore' or errors='replace' to open.

answered Apr 11, 2012 at 21:31

agf

178k45 gold badges300 silver badges241 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AlphaTested Over a year ago

The above was helpful but didn't ultimately solve the issue, I ran in to more greek errors (I'm new to programming). I ended up opening the word list in a text editor and resaved as a utf-8 format which then worked. Thanks to agf for your help!

agf Over a year ago

@AlphaTested If you don't know the encoding, another way would be to use chardet to detect it.

Collectives™ on Stack Overflow

String processing error: UnicodeDecodeError: 'utf8' codec can't decode

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related