1

I'm trying to analyze the a series of passwords for frequency. My script is working with other input media, however it appears that there's some bad characters in my current data set. How can I get around the "bad" data?

import re
import collections 
words = re.findall('\w+', open('rockyou.txt').read().lower())
a=collections.Counter(words).most_common(50)
for word in a:
     print(word)

I then get the error:

Traceback (most recent call last):
  File "shakecount.py", line 3, in <module>
    words = re.findall('\w+', open('rockyou.txt').read().lower().ASCII)
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 5079963: invalid continuation byte

Any ideas?

1 Answer 1

5

Your code doesn't exactly match your error (I assume an attempt at debugging?), but your text file isn't UTF-8.

You need to manually specify an encoding, with my best guess being latin-1:

words = re.findall('\w+', open('rockyou.txt', encoding='latin-1').read().lower())

if you want to continue despite errors, you can pass errors='ignore' or errors='replace' to open.

Sign up to request clarification or add additional context in comments.

2 Comments

The above was helpful but didn't ultimately solve the issue, I ran in to more greek errors (I'm new to programming). I ended up opening the word list in a text editor and resaved as a utf-8 format which then worked. Thanks to agf for your help!
@AlphaTested If you don't know the encoding, another way would be to use chardet to detect it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.