1

I have a Pandas DataFrame containing a 2D array as a column looking something like the following:

Name    2DValueList
item 1  [ [ 0.0, 1.0 ], [ 0.0, 6.0 ], [ 0.0, 2.0 ] ]
item 2  [ [ 0.0, 2.0 ], [ 0.0, 1.0 ], [ 0.0, 1.0 ] ]
item 3  [ [ 0.0, 1.0 ], [ 0.0, 3.0 ], [ 0.0, 5.0 ], [ 0.0, 1.0 ] ]
item 4  
item 5  [ [ 0.0, 4.0 ], [ 0.0, 1.0 ], [ 0.0, 2.0 ] ]

The first value isn't relative to this question so I've just made them all 0. I'm only interested in the second values. Also notice the amount of pairs can vary or be empty.

I want to be able to make a new dataframe that just contains the top (largest) n elements from the array.

It would look like this for the top 2 elements:

Name    2DValueList
item 1  [ [ 0.0, 6.0 ], [ 0.0, 2.0 ] ]
item 2  [ [ 0.0, 2.0 ], [ 0.0, 1.0 ] ]
item 3  [ [ 0.0, 5.0 ], [ 0.0, 3.0 ] ]
item 4
item 5  [ [ 0.0, 4.0 ], [ 0.0, 2.0 ] ]

I would use pandas nlargest, but I'm not sure how to make it accept a column that is a 2D array.

In reality, the 2D array holds thousands of value pairs and there are tens of thousands of rows. I'm open to better ways to hold this data that would be more versatile.

2
  • Has each element of 2DValueList the same number of elements? Or could each 2D array have its own individual number of pairs? Commented Dec 2, 2019 at 10:01
  • each row will have a varying number of pairs if that makes sense Commented Dec 3, 2019 at 2:41

2 Answers 2

1

If every cell of 2DValueList is list of lists, the efficient way is using heapq.nlargest with itemgetter together with list comprehension

from heapq import nlargest
from operator import itemgetter

df['new_list'] = [nlargest(2, x, key=itemgetter(1)) for x in df['2DValueList']]

Out[119]:
     Name               2DValueList            new_list
0  item 1  [[0, 1], [0, 6], [0, 2]]    [[0, 6], [0, 2]]
1  item 2  [[0, 2], [0, 1], [0, 1]]    [[0, 2], [0, 1]]
2  item 3  [[0, 1], [0, 3], [0, 5]]    [[0, 5], [0, 3]]
3  item 4  [[0, 4], [0, 1], [0, 2]]    [[0, 4], [0, 2]]

If each cell is a numpy 2darray, the above method still works fine. However, I think using numpy argsort would be better

df['new_list'] = [x[np.argsort(-x, axis=0)[:2,1]] for x in df['2DValueList']]

Out[128]:
     Name               2DValueList            new_list
0  item 1  [[0, 1], [0, 6], [0, 2]]    [[0, 6], [0, 2]]
1  item 2  [[0, 2], [0, 1], [0, 1]]    [[0, 2], [0, 1]]
2  item 3  [[0, 1], [0, 3], [0, 5]]    [[0, 5], [0, 3]]
3  item 4  [[0, 4], [0, 1], [0, 2]]    [[0, 4], [0, 2]]

Lastly, if you don't need the top n largest sub-array in sorted order, argpartition would be faster than argsort

Sign up to request clarification or add additional context in comments.

1 Comment

Each cell is a list of lists, so i'll try your first method. Thanks!
0
import ast

df['2DValueList'] = df['2DValueList'].apply(ast.literal_eval).apply(lambda x: sorted(x,reverse=True)[:2])

     Name       2DValueList
0  item 1  [[0, 6], [0, 2]]
1  item 2  [[0, 2], [0, 1]]
2  item 3  [[0, 5], [0, 3]]
3  item 4  [[0, 4], [0, 2]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.