9

I'm new to NLTK. I'm getting this error and I've searched around for encoding/decoding and specifically the UnicodeDecodeError but this error seems specific to the NLTK source code.

Here's the error:

Traceback (most recent call last):
  File "A:\Python\Projects\Test\main.py", line 2, in <module>
    print(pos_tag(word_tokenize("John's big idea isn't all that bad.")))
  File "A:\Python\Python\lib\site-packages\nltk\tag\__init__.py", line 100, in pos_tag
    tagger = load(_POS_TAGGER)
  File "A:\Python\Python\lib\site-packages\nltk\data.py", line 779, in load
    resource_val = pickle.load(opened_resource)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

How do I go around fixing this error?

Here's what causes the error:

from nltk import pos_tag, word_tokenize
print(pos_tag(word_tokenize("John's big idea isn't all that bad.")))
4
  • It's the pos_tag function causing the error. Commented Aug 25, 2014 at 20:46
  • There's nothing in the code you show here that would generate the error. Print the repr of the string you're passing. Commented Aug 25, 2014 at 21:08
  • Returns the string but with ' surrounding it. Commented Aug 25, 2014 at 21:13
  • @MarkRansom I don't know what you mean, the function pos_tag is causing the error. I think the encoding error is generated on the pickle.load function. I'm not sure what to do. Commented Aug 25, 2014 at 21:22

4 Answers 4

5

try this... NLTK 3.0.1 with Python 2.7.x

import io
f = io.open(txtFile, 'rU', encoding='utf-8')
Sign up to request clarification or add additional context in comments.

2 Comments

Worked like a charm! I use nltk 3.1 and Python 2.7.x.
This is great ! can you also explain why using io solves the problem ?
4

I had the same problem with you. I use Python 3.4 in Windows 7.

I had installed the "nltk-3.0.0.win32.exe" (from here). But when i installed the "nltk-3.0a4.win32.exe" (from here), my problem with nltk.pos_tag was solved. Check it.

EDIT: If the second link doesn't work, you can look here.

1 Comment

the second link seems to be broken. Do you have any alternate links?
-2

Duplicate: NLTK 3 POS_TAG throws UnicodeDecodeError

Long story short: NLTK isn't compatible with Python 3. You have to use NLTK 3 which sounds a bit experimental at this point.

1 Comment

I am using NLTK 3 and Python 3.4 and still get this error.
-2

Try using the module "textclean"

>>> pip install textclean

Python code

from textclean.textclean import textclean
text = textclean.clean("John's big idea isn't all that bad.")
print pos_tag(word_tokenize(text))

1 Comment

this module sounds like a horrible idea. it's particularly bad here, because the error is occurring when trying to decode a pickle — a structured data format that you will irreparably destroy if you try to blindly "clean" it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.