I need to remove every duplicate item from a list of more than 100 million things. I tried converting the list to a set and back again using the Set method, but it is far too sluggish and slow and memory-intensive. Are there any other effective solutions to achieve this?
-
you are probably not going to find a faster method.juanpa.arrivillaga– juanpa.arrivillaga2023-01-09 19:52:12 +00:00Commented Jan 9, 2023 at 19:52
-
100M+ "things", how many GB is that? Do you have enough RAM? I mean swapping may slow everything down and if it happens to be the case, that should be probably addressed first.VPfB– VPfB2023-01-09 20:23:26 +00:00Commented Jan 9, 2023 at 20:23
Add a comment
|
1 Answer
If you're willing to sort your list, then this is fairly trivial. Sort it first, then take the unique items. This is the same approach as sort | uniq in shell, and can be quite memory efficient (using disk instead, of course, Python's built-in sort will be in-memory).
def unique_justseen(iterable, key=None):
"List unique elements, preserving order. Remember only the element just seen."
# unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
# unique_justseen('ABBcCAD', str.lower) --> A B c A D
return map(next, map(operator.itemgetter(1), groupby(iterable, key)))
Is there a reason you care if this is sluggish? If you need to do this operation often then something is wrong in the way you are handling data.
7 Comments
wjandrea
Ugh, itertools soup. It's easier to read like this:
for _k, g in groupby(iterable, key): yield next(g)mozway
sorting is almost certainly less efficient than using a set
wjandrea
@mozway I suppose since sets involve O(n) space in addition to O(n) time, memory access could be OP's bottleneck. Meanwhile sorting is constant space and O(n log n) time, ofc.
juanpa.arrivillaga
@wjandrea sorting is O(N) space in Python
wjandrea
@juanpa Oh, huh, I didn't realize that. I think I was thinking of a different sorting algo, sorry, so I deleted my comment as soon as I started reading about Timsort.
|