How to remove duplicates of list of lists based on element in nested list

Question

I want to remove duplicates from a list of lists. The first element is NOT always unique for every second element in the nested list. The first value is unique for the whole list of lists. The numbers occurs only once in the whole list of lists, but are not ordered.

my_list = [[4, 'C'], [1, 'A'], [3, 'B'], [2, 'A'], [5, 'C']]

Removing the duplicates is based on the second element in the nested list. I need the minimum value for each unique second element, like:

my_unique_list = [[1, 'A'], [3, 'B'], [4, 'C']]

It doesn't matter what order the output is in.

So, pick 1 for 'A' (as 1 is lower than 2 from [2, 'A']), 3 for 'B' (there are no other values for 'B'), and 4 for 'C' (as 4 is lower than 5, from [5, 'C']).

Are the first values always unique for each second value?

Nick
– Nick

2020-01-27 11:50:20 +00:00
Commented Jan 27, 2020 at 11:50 — Nick
– Nick, Commented Jan 27, 2020 at 11:50

Martijn Pieters · Accepted Answer · 2020-01-30 23:17:48Z

Use a dictionary to map unique letters (second values) to the minimum value for each letter, then simply take the [value, key] pairs from that dictionary as your output:

minimi = {}
inf = float('inf')
for val, key in my_list:
    # float('inf') as default value is always larger, so val is picked
    # if the key isn't present yet.
    minimi[key] = min(val, minimi.get(key, inf))

my_unique_list = [[v, k] for k, v in minimi.items()]

By using a dictionary as intermediary you can filter the input in linear time.

Demo:

>>> my_list = [[4, 'C'], [1, 'A'], [3, 'B'], [2, 'A'], [5,'C']]
>>> minimi, inf = {}, float('inf')
>>> for val, key in my_list:
...     minimi[key] = min(val, minimi.get(key, inf))
...
>>> minimi
{'C': 4, 'A': 1, 'B': 3}
>>> my_unique_list = [[v, k] for k, v in minimi.items()]
>>> my_unique_list
[[4, 'C'], [1, 'A'], [3, 'B']]

Why should you care about running time? Because as your input grows, so does your running time. For approaches that take O(N^2) (quadratic) time, as you go from 1000 items to 1 million (so 1000 times the size), your running time would increase by 1 million times! For O(N logN) approaches (those that use sorting), the running time would increase by ~2000 times, while a linear approach as above would take 1000 times as long, scaling linearly as your inputs scale.

For large inputs, that can make the difference between 'takes an hour or two' to 'takes millions of years'.

Here is a time-trial comparison between this approach and zamir's sorting-and-set approach (O(N logN)) as well as TJC World's Pandas approach (also O(N logN)):

from string import ascii_uppercase
from functools import partial
from timeit import Timer
import random
import pandas as pd

def gen_data(N):
    return [[random.randrange(1_000_000), random.choice(ascii_uppercase)] for _ in range(N)]

def with_dict(l, _min=min, _inf=float('inf')):
    minimi = {}
    m_get = minimi.get
    for val, key in l:
        minimi[key] = _min(val, m_get(key, _inf))
    return [[v, k] for k, v in minimi.items()]

def with_set_and_sort(l):
    already_encountered = set()
    ae_add = already_encountered.add
    return [i for i in sorted(l) if i[1] not in already_encountered and not ae_add(i[1])]

def with_pandas(l):
    return (
        pd.DataFrame(l)
        .sort_values(by=0)
        .drop_duplicates(1)
        .to_numpy()
        .tolist()
    )

for n in (100, 1000, 10_000, 100_000, 1_000_000):
    testdata = gen_data(n)
    print(f"{n:,} entries:")
    for test in (with_dict, with_set_and_sort, with_pandas):
        count, total = Timer(partial(test, testdata)).autorange()
        print(f"{test.__name__:>20}: {total/count * 1000:8.3f}ms")
    print()

I've used all the little performance tricks in there I know of; avoiding repeated lookups of globals and attributes by caching them in local names outside of the loops.

This outputs:

100 entries:
           with_dict:    0.028ms
   with_set_and_sort:    0.032ms
         with_pandas:    2.070ms

1,000 entries:
           with_dict:    0.242ms
   with_set_and_sort:    0.369ms
         with_pandas:    2.312ms

10,000 entries:
           with_dict:    2.331ms
   with_set_and_sort:    5.639ms
         with_pandas:    5.476ms

100,000 entries:
           with_dict:   23.105ms
   with_set_and_sort:  127.772ms
         with_pandas:   40.330ms

1,000,000 entries:
           with_dict:  245.982ms
   with_set_and_sort: 2494.305ms
         with_pandas:  578.952ms

So, with only 100 inputs, the sorting approach may appear to be as fast (a few ms difference either way), but as the inputs grow that approach loses ground at an accelerating pace.

Pandas loses out on all fronts here. Dataframes are a great tool, but the wrong tool here. They are hefty datastructures, so for small inputs their high overhead puts them into the millisecond range, way behind the other two options. At 10k entries it starts to beat the sorting-and-set approach, but even though dataframe operations are highly optimised, the growth of sorting runtime with larger inputs still can't beat a linear approach.

zamir · Accepted Answer · 2020-01-27 12:10:56Z

1

already_encountered = set()
my_new_list = [i for i in sorted(my_list) if i[1] not in already_encountered and not already_encountered.add(i[1])]

Output:

[[1, 'A'], [3, 'B'], [4, 'C']]

edited Jan 27, 2020 at 12:10

answered Jan 27, 2020 at 12:03

zamir

2,5522 gold badges13 silver badges23 bronze badges

5 Comments

Martijn Pieters Over a year ago

Yes, so now that you added sorting, it takes O(N logN) time, instead of linear time.

zamir Over a year ago

@MartijnPieters Your solution is taking more time in my machine, 10.7 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) while mine is taking 4.68 µs ± 570 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Martijn Pieters Over a year ago

How many inputs? Have you tested this with at, say, 1000 items, then 1000000 items? On small inputs, it is easy to call a few microseconds a win.

zamir Over a year ago

@MartijnPieters Actually I have also tried it on a list of 1 million items and yours execution time was 7.04 s while mine was 6.76 s this is not a big performance boost (a few milisecs)

Martijn Pieters Over a year ago

I'm not talking about repetitions. I'm talking about number of inputs. I've added a time trial to my answer, at 1 million items your approach takes nearly 10 times as long.

Vicrobot · Accepted Answer · 2020-01-27 11:59:27Z

0

Using pandas;

>>> import pandas as pd
>>> my_list = [[1, 'A'], [2, 'A'], [3, 'B'], [4, 'C'], [5,'C']]
>>> df = pd.DataFrame(my_list)
>>> df.sort_values(by = 0).drop_duplicates(1).to_numpy().tolist()
[[1, 'A'], [3, 'B'], [4, 'C']]

edited Jan 27, 2020 at 11:59

answered Jan 27, 2020 at 11:57

Vicrobot

3,9881 gold badge20 silver badges35 bronze badges

5 Comments

Martijn Pieters Over a year ago

Sorting makes this a O(N logN) approach, as the number of inputs grows, linear time approaches (such as using a dictionary to pick out the unique values) will win.

Vicrobot Over a year ago

@MartijnPieters So dictionary uses less time for sorting?

Martijn Pieters Over a year ago

A dictionary doesn't need to sort. Adding or updating keys takes constant time. You have N inputs, so after N steps, each taking constant time, we have a dictionary with minimi. We then take another N steps to build the output list. Total time: 2N, but when talking about asymptotic runtime growth the lower order constants are dropped.

Martijn Pieters Over a year ago

I've added your pandas approach to my time trial too. It shows that Pandas is way, way too heavy a hammer here for this problem.

Martijn Pieters Over a year ago

You can avoid sorting in Pandas too, by using groupby(); pd.DataFrame(my_list).groupby(1).min().reset_index()[[0, 1]].to_numpy().tolist() has linear performance (once you get past 1000 entries). Its fixed costs are still about 50% higher than simply using a dictionary, however.

Collectives™ on Stack Overflow

How to remove duplicates of list of lists based on element in nested list

3 Answers 3

Comments

5 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related