1

I am trying to do a clean doc action to remove stopwords, pos tagging and stemming below is my code

 def cleanDoc(doc):
    stopset = set(stopwords.words('english'))
    stemmer = nltk.PorterStemmer()
    #Remove punctuation,convert lower case and split into seperate words
    tokens = re.findall(r"<a.*?/a>|<[^\>]*>|[\w'@#]+", doc.lower() ,flags = re.UNICODE | re.LOCALE)
    #Remove stopwords and words < 2
    clean = [token for token in tokens if token not in stopset and len(token) > 2]
    #POS Tagging
    pos = nltk.pos_tag(clean)
    #Stemming
    final = [stemmer.stem(word) for word in pos]
    return final

I got this error :

Traceback (most recent call last):
  File "C:\Users\USer\Desktop\tutorial\main.py", line 38, in <module>
    final = cleanDoc(doc)
  File "C:\Users\USer\Desktop\tutorial\main.py", line 30, in cleanDoc
    final = [stemmer.stem(word) for word in pos]
  File "C:\Python27\lib\site-packages\nltk\stem\porter.py", line 556, in stem
    stem = self.stem_word(word.lower(), 0, len(word) - 1)
AttributeError: 'tuple' object has no attribute 'lower'
1
  • 5
    Did you try any debugging to find out why word is a tuple and not a string? Or look for documentation for ntlk.pos_tag() to see what it returns instead of a list of strings? Commented Apr 17, 2013 at 13:28

2 Answers 2

5

In this line:

pos = nltk.pos_tag(clean)

nltk.pos_tag() returns a list of tuples (word, tag), not strings. Use this to get the words:

pos = nltk.pos_tag(clean)
final = [stemmer.stem(tagged_word[0]) for tagged_word in pos]
Sign up to request clarification or add additional context in comments.

Comments

2

nltk.pos_tag returns a list of tuples, not a list of strings. Perhaps you want

final = [stemmer.stem(word) for word, _ in pos]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.