python return duplicates in list

Question

How do I find list of duplicates from list of strings? clean_up function is given

def clean_up(s):
""" (str) -> str

Return a new string based on s in which all letters have been
converted to lowercase and punctuation characters have been stripped 
from both ends. Inner punctuation is left untouched. 

>>> clean_up('Happy Birthday!!!')
'happy birthday'
>>> clean_up("-> It's on your left-hand side.")
" it's on your left-hand side"
"""

punctuation = """!"',;:.-?)([]<>*#\n\t\r"""
result = s.lower().strip(punctuation)
return result

Here is my duplicate function.

def duplicate(text):
""" (list of str) -> list of str

>>> text = ['James Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
'James Gosling\n']
>>> duplicate(text)
['james']
"""

cleaned = ''
non_duplicate = []
unique = []
for word in text:
    cleaned += clean_up(word).replace(",", " ") + " "
    words = cleaned.split()        
    for word in words:
         if word in unique:

I am stuck in here.. I can't use dictionary or any other technique that keeps a count of the frequency of each word in the text. Please help..

jonrsharpe · Accepted Answer · 2014-03-08 09:33:50Z

1

You have a problem here:

cleaned += clean_up(word).replace(",", " ") + " "

This line adds the new "word" to a growing string of all words so far. Therefore each time through the for loop, you recheck all words you have seen so far.

Instead, you need to do:

for phrase in text:
    for word in phrase.split(" "):
        word = clean_up(word)

This means you only process each word once. You may then need to add it to one of your lists, depending on whether it's already in either of them. I suggest you call your lists seen and duplicates, to make it clearer what is going on.

edited Mar 8, 2014 at 9:33

answered Mar 8, 2014 at 8:19

jonrsharpe

123k31 gold badges277 silver badges488 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jon Clements Over a year ago

Good points. To be fair to the OP though - clean_up only removes leading/trailing commas (and other puncutation)... It seems that a re.findall('\w+', text) would be a more appropriate tokenizer, but that's up to the OP. The OP may also wish to consider using sets if possible, rather than lists.

Collectives™ on Stack Overflow

python return duplicates in list

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related