2

I have a list of lists that has items in element 0 and value associated with it in element 1. Each item could appear more than once. I would like to create a list of uniques items with the max value associated with each one. My code accomplishes this, but seems very inefficient. Also, this is a simplified example. mylist could be 100,000 rows. Any suggestions of improving efficiency?

mylist = [['Item 1', 12],['Item 1', 10], ['Item 3', 12],['Item 4', 10], ['Item 3', 14]]

# get unique items
my_unique_items = list(set(x[0] for x in mylist))

# make it a list of list
my_unique_items = [[x] for x in my_unique_items]

# iterate over list items
for item in my_unique_items:

    # do list comp to get max value and append
    item.append(max([x[1] for x in mylist if x[0] == item[0]]))

print my_unique_items

3 Answers 3

1

It would be more efficient to only loop through mylist once. If you only care about the max value for each item key, just keep a mapping of items and their max values and compare them as you go through the list.

This has a worst case of O(n), whereas your original had a worst case of O(n^2).

item_maxes = {}
for item in mylist:
    max_value = item_maxes.setdefault(item[0], None)
    if max_value is None or item[1] > max_value:
        item_maxes[item[0]] = item[1]

Edit: I think ShadowRanger's version of this method is much cleaner looking:

max_vals = {}
for item, value in mylist:
    max_vals[item] = max(max_vals.get(item, value), value)
Sign up to request clarification or add additional context in comments.

3 Comments

Note that the last line should assign item[1] to the item_maxes dict, not item[0[
I tested all the loops over 1,000,000 rows. Mine took 18 seconds, yours was the fastest at .4 seconds so I chose it as best answer. @shadowranger was .6 seconds
@user2242044: Not wholly surprising; the costs to use .get and max every run of the loop are a lot higher than you might expect. Amusingly, stupid hacks like caching bound methods and built-in functions to more local scope will end up changing the performance quite a bit, e.g. adding: max_vals_get = max_vals.get and _max = max just outside the loop, then doing: max_vals[item] = _max(max_vals_get(item, value), value) would likely reduce runtime for large inputs.
1

If the inputs are already sorted (or you want the outputs sorted), and nice way to do this is with itertools.groupby:

from future_builtins import map  # On Python 2.x only, to get generator based map

from itertools import groupby
from operator import itemgetter

# Nicer names, and avoid recreating getvalue on each loop
getitem, getvalue = itemgetter(0), itemgetter(1)

# If not already sorted, must sort by same key we're grouping on:
mylist.sort(key=getitem)

max_vals = [(k, max(map(getvalue, g))) for k, g in groupby(mylist, key=getitem)]

If you don't care about order, and your items are hashable, a dict is generally going to be faster (it might use slightly more memory if most items are unique):

max_vals = {}
for item, value in mylist:
    max_vals[item] = max(max_vals.get(item, value), value)

Comments

0

Using groupby from the itertools module and itemgetter from the the operator module.

>>> from itertools import groupby
>>> from operator import itemgetter
>>> d = {}
>>> for g, data in groupby(sorted(mylist, key=itemgetter(0)), key=itemgetter(0)):
...     d[g] = max(list(zip(*data))[1])
... 
>>> d
{'Item 1': 12, 'Item 3': 14, 'Item 4': 10}

You can also using the itertools.islice instead of using the list constructor and normal slice operation.

>>> for g, data in groupby(sorted(mylist, key=itemgetter(0)), key=itemgetter(0)):
...     d[g] = max(*islice(zip(*data),  1, None))
... 
>>> d
{'Item 1': 12, 'Item 3': 14, 'Item 4': 10}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.