0

I want to remove duplicates from a list of lists. The first element is NOT always unique for every second element in the nested list. The first value is unique for the whole list of lists. The numbers occurs only once in the whole list of lists, but are not ordered.

my_list = [[4, 'C'], [1, 'A'], [3, 'B'], [2, 'A'], [5, 'C']]

Removing the duplicates is based on the second element in the nested list. I need the minimum value for each unique second element, like:

my_unique_list = [[1, 'A'], [3, 'B'], [4, 'C']]

It doesn't matter what order the output is in.

So, pick 1 for 'A' (as 1 is lower than 2 from [2, 'A']), 3 for 'B' (there are no other values for 'B'), and 4 for 'C' (as 4 is lower than 5, from [5, 'C']).

1
  • Are the first values always unique for each second value? Commented Jan 27, 2020 at 11:50

3 Answers 3

9

Use a dictionary to map unique letters (second values) to the minimum value for each letter, then simply take the [value, key] pairs from that dictionary as your output:

minimi = {}
inf = float('inf')
for val, key in my_list:
    # float('inf') as default value is always larger, so val is picked
    # if the key isn't present yet.
    minimi[key] = min(val, minimi.get(key, inf))

my_unique_list = [[v, k] for k, v in minimi.items()]

By using a dictionary as intermediary you can filter the input in linear time.

Demo:

>>> my_list = [[4, 'C'], [1, 'A'], [3, 'B'], [2, 'A'], [5,'C']]
>>> minimi, inf = {}, float('inf')
>>> for val, key in my_list:
...     minimi[key] = min(val, minimi.get(key, inf))
...
>>> minimi
{'C': 4, 'A': 1, 'B': 3}
>>> my_unique_list = [[v, k] for k, v in minimi.items()]
>>> my_unique_list
[[4, 'C'], [1, 'A'], [3, 'B']]

Why should you care about running time? Because as your input grows, so does your running time. For approaches that take O(N^2) (quadratic) time, as you go from 1000 items to 1 million (so 1000 times the size), your running time would increase by 1 million times! For O(N logN) approaches (those that use sorting), the running time would increase by ~2000 times, while a linear approach as above would take 1000 times as long, scaling linearly as your inputs scale.

For large inputs, that can make the difference between 'takes an hour or two' to 'takes millions of years'.

Here is a time-trial comparison between this approach and zamir's sorting-and-set approach (O(N logN)) as well as TJC World's Pandas approach (also O(N logN)):

from string import ascii_uppercase
from functools import partial
from timeit import Timer
import random
import pandas as pd

def gen_data(N):
    return [[random.randrange(1_000_000), random.choice(ascii_uppercase)] for _ in range(N)]

def with_dict(l, _min=min, _inf=float('inf')):
    minimi = {}
    m_get = minimi.get
    for val, key in l:
        minimi[key] = _min(val, m_get(key, _inf))
    return [[v, k] for k, v in minimi.items()]

def with_set_and_sort(l):
    already_encountered = set()
    ae_add = already_encountered.add
    return [i for i in sorted(l) if i[1] not in already_encountered and not ae_add(i[1])]

def with_pandas(l):
    return (
        pd.DataFrame(l)
        .sort_values(by=0)
        .drop_duplicates(1)
        .to_numpy()
        .tolist()
    )

for n in (100, 1000, 10_000, 100_000, 1_000_000):
    testdata = gen_data(n)
    print(f"{n:,} entries:")
    for test in (with_dict, with_set_and_sort, with_pandas):
        count, total = Timer(partial(test, testdata)).autorange()
        print(f"{test.__name__:>20}: {total/count * 1000:8.3f}ms")
    print()

I've used all the little performance tricks in there I know of; avoiding repeated lookups of globals and attributes by caching them in local names outside of the loops.

This outputs:

100 entries:
           with_dict:    0.028ms
   with_set_and_sort:    0.032ms
         with_pandas:    2.070ms

1,000 entries:
           with_dict:    0.242ms
   with_set_and_sort:    0.369ms
         with_pandas:    2.312ms

10,000 entries:
           with_dict:    2.331ms
   with_set_and_sort:    5.639ms
         with_pandas:    5.476ms

100,000 entries:
           with_dict:   23.105ms
   with_set_and_sort:  127.772ms
         with_pandas:   40.330ms

1,000,000 entries:
           with_dict:  245.982ms
   with_set_and_sort: 2494.305ms
         with_pandas:  578.952ms

So, with only 100 inputs, the sorting approach may appear to be as fast (a few ms difference either way), but as the inputs grow that approach loses ground at an accelerating pace.

Pandas loses out on all fronts here. Dataframes are a great tool, but the wrong tool here. They are hefty datastructures, so for small inputs their high overhead puts them into the millisecond range, way behind the other two options. At 10k entries it starts to beat the sorting-and-set approach, but even though dataframe operations are highly optimised, the growth of sorting runtime with larger inputs still can't beat a linear approach.

Sign up to request clarification or add additional context in comments.

Comments

1
already_encountered = set()
my_new_list = [i for i in sorted(my_list) if i[1] not in already_encountered and not already_encountered.add(i[1])]

Output:

[[1, 'A'], [3, 'B'], [4, 'C']]

5 Comments

Yes, so now that you added sorting, it takes O(N logN) time, instead of linear time.
@MartijnPieters Your solution is taking more time in my machine, 10.7 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) while mine is taking 4.68 µs ± 570 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
How many inputs? Have you tested this with at, say, 1000 items, then 1000000 items? On small inputs, it is easy to call a few microseconds a win.
@MartijnPieters Actually I have also tried it on a list of 1 million items and yours execution time was 7.04 s while mine was 6.76 s this is not a big performance boost (a few milisecs)
I'm not talking about repetitions. I'm talking about number of inputs. I've added a time trial to my answer, at 1 million items your approach takes nearly 10 times as long.
0

Using pandas;

>>> import pandas as pd
>>> my_list = [[1, 'A'], [2, 'A'], [3, 'B'], [4, 'C'], [5,'C']]
>>> df = pd.DataFrame(my_list)
>>> df.sort_values(by = 0).drop_duplicates(1).to_numpy().tolist()
[[1, 'A'], [3, 'B'], [4, 'C']]

5 Comments

Sorting makes this a O(N logN) approach, as the number of inputs grows, linear time approaches (such as using a dictionary to pick out the unique values) will win.
@MartijnPieters So dictionary uses less time for sorting?
A dictionary doesn't need to sort. Adding or updating keys takes constant time. You have N inputs, so after N steps, each taking constant time, we have a dictionary with minimi. We then take another N steps to build the output list. Total time: 2N, but when talking about asymptotic runtime growth the lower order constants are dropped.
I've added your pandas approach to my time trial too. It shows that Pandas is way, way too heavy a hammer here for this problem.
You can avoid sorting in Pandas too, by using groupby(); pd.DataFrame(my_list).groupby(1).min().reset_index()[[0, 1]].to_numpy().tolist() has linear performance (once you get past 1000 entries). Its fixed costs are still about 50% higher than simply using a dictionary, however.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.