0

I need to remove every duplicate item from a list of more than 100 million things. I tried converting the list to a set and back again using the Set method, but it is far too sluggish and slow and memory-intensive. Are there any other effective solutions to achieve this?

2
  • you are probably not going to find a faster method. Commented Jan 9, 2023 at 19:52
  • 100M+ "things", how many GB is that? Do you have enough RAM? I mean swapping may slow everything down and if it happens to be the case, that should be probably addressed first. Commented Jan 9, 2023 at 20:23

1 Answer 1

2

If you're willing to sort your list, then this is fairly trivial. Sort it first, then take the unique items. This is the same approach as sort | uniq in shell, and can be quite memory efficient (using disk instead, of course, Python's built-in sort will be in-memory).

Itertools Recipes

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBcCAD', str.lower) --> A B c A D
    return map(next, map(operator.itemgetter(1), groupby(iterable, key)))

Is there a reason you care if this is sluggish? If you need to do this operation often then something is wrong in the way you are handling data.

Sign up to request clarification or add additional context in comments.

7 Comments

Ugh, itertools soup. It's easier to read like this: for _k, g in groupby(iterable, key): yield next(g)
sorting is almost certainly less efficient than using a set
@mozway I suppose since sets involve O(n) space in addition to O(n) time, memory access could be OP's bottleneck. Meanwhile sorting is constant space and O(n log n) time, ofc.
@wjandrea sorting is O(N) space in Python
@juanpa Oh, huh, I didn't realize that. I think I was thinking of a different sorting algo, sorry, so I deleted my comment as soon as I started reading about Timsort.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.