1

I have the following data:

[['The',
  'Fulton',
  'County',
  'Grand',
  'Jury',
  'said',
  'Friday',
  'an',
  'investigation',
  'of',
  "Atlanta's",
  'recent',
  'primary',
  'election',
  'produced',
  '``',
  'no',
  'evidence',
  "''",
  'that',
  'any',
  'irregularities',
  'took',
  'place',
  '.'],
 ['The',
  'jury',
  'further',
  'said',
  'in',
  'term-end',
  'presentments',
  'that',
  'the',
  'City',
  'Executive',
  'Committee',
  ',',
  'which',
  'had',
  'over-all',
  'charge',
  'of',
  'the',
  'election',
  ',',
  '``',
  'deserves',
  'the',
  'praise',
  'and',
  'thanks',
  'of',
  'the',
  'City',
  'of',
  'Atlanta',
  "''",
  'for',
  'the',
  'manner',
  'in',
  'which',
  'the',
  'election',
  'was',
  'conducted',
  '.']]

So I have a list that consistst of 2 other list(in my case I have 50000 lists in one big list). I want to delete all punctuation and stopwords like "the", "a" "of" etc.

Here is what I have coded:

import string
from nltk.corpus import stopwords
nltk.download('stopwords')

punct = list(string.punctuation)
punct.append("``")
punct.append("''")
stops = set(stopwords.words("english")) 

res = [[word.lower() for word in sentence if word not in punct or word.lower() in not stops] for sentence in dataset] 

But it returns me the same list of lists that I initially had. What is wrong with my code?

3 Answers 3

2

You shoud use and unstead of or:

res = [[word.lower() for word in sentence if word not in punct and word.lower() not in stops] for sentence in dataset]

Otherwise you get all elements since they are not exist at leatst in one of stops or punct list.

Sign up to request clarification or add additional context in comments.

Comments

2

Since punct and stops do not over lap, every word will either not be in one or the other (or possibly both); you want to test for words that are not in both.

Comments

0

Assumning it would be ok to update the stops this is an alternative that avoids the 2-level comprehension

import string
import nltk
from nltk.corpus import stopwords


dataset = [
  ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an',
   'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election',
   'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities',
   'took', 'place', '.'],
  ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments',
   'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had',
   'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves',
   'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta',
   "''", 'for', 'the', 'manner',
   'in', 'which', 'the', 'election', 'was', 'conducted', '.']
  ]

nltk.download('stopwords')

punct = list(string.punctuation)
punct.append("``")
punct.append("''")

stops = set(stopwords.words("english"))

# Union of punct and stops
stops.update(punct)
res1 = [[word for word in sentence if word.lower() not in stops]
        for sentence in dataset]

# Alternative solution that avoids an explict 2-level list comprehension
def filter_the(sentence, stops):
    return [word for word in sentence if word.lower() not in stops]


res2 = [filter_the(sentence, stops) for sentence in dataset]


print(res1 == res2)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.