How can I efficiently remove duplicates from a large list in Python?

Question

I need to remove every duplicate item from a list of more than 100 million things. I tried converting the list to a set and back again using the Set method, but it is far too sluggish and slow and memory-intensive. Are there any other effective solutions to achieve this?

100M+ "things", how many GB is that? Do you have enough RAM? I mean swapping may slow everything down and if it happens to be the case, that should be probably addressed first. — VPfB
– VPfB, Commented Jan 9, 2023 at 20:23

juanpa.arrivillaga · Accepted Answer · 2023-01-09 19:56:05Z

2

If you're willing to sort your list, then this is fairly trivial. Sort it first, then take the unique items. This is the same approach as sort | uniq in shell, and can be quite memory efficient (using disk instead, of course, Python's built-in sort will be in-memory).

Itertools Recipes

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBcCAD', str.lower) --> A B c A D
    return map(next, map(operator.itemgetter(1), groupby(iterable, key)))

Is there a reason you care if this is sluggish? If you need to do this operation often then something is wrong in the way you are handling data.

edited Jan 9, 2023 at 19:56

juanpa.arrivillaga

97.6k14 gold badges141 silver badges190 bronze badges

answered Jan 9, 2023 at 18:30

Cireo

4,4572 gold badges22 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

wjandrea Over a year ago

Ugh, itertools soup. It's easier to read like this: for _k, g in groupby(iterable, key): yield next(g)

mozway Over a year ago

sorting is almost certainly less efficient than using a set

wjandrea Over a year ago

@mozway I suppose since sets involve O(n) space in addition to O(n) time, memory access could be OP's bottleneck. Meanwhile sorting is constant space and O(n log n) time, ofc.

juanpa.arrivillaga Over a year ago

@wjandrea sorting is O(N) space in Python

wjandrea Over a year ago

@juanpa Oh, huh, I didn't realize that. I think I was thinking of a different sorting algo, sorry, so I deleted my comment as soon as I started reading about Timsort.

|

Collectives™ on Stack Overflow

How can I efficiently remove duplicates from a large list in Python?

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related