Pandas working with 2D array inside dataframe

Question

I have a Pandas DataFrame containing a 2D array as a column looking something like the following:

Name    2DValueList
item 1  [ [ 0.0, 1.0 ], [ 0.0, 6.0 ], [ 0.0, 2.0 ] ]
item 2  [ [ 0.0, 2.0 ], [ 0.0, 1.0 ], [ 0.0, 1.0 ] ]
item 3  [ [ 0.0, 1.0 ], [ 0.0, 3.0 ], [ 0.0, 5.0 ], [ 0.0, 1.0 ] ]
item 4  
item 5  [ [ 0.0, 4.0 ], [ 0.0, 1.0 ], [ 0.0, 2.0 ] ]

The first value isn't relative to this question so I've just made them all 0. I'm only interested in the second values. Also notice the amount of pairs can vary or be empty.

I want to be able to make a new dataframe that just contains the top (largest) n elements from the array.

It would look like this for the top 2 elements:

Name    2DValueList
item 1  [ [ 0.0, 6.0 ], [ 0.0, 2.0 ] ]
item 2  [ [ 0.0, 2.0 ], [ 0.0, 1.0 ] ]
item 3  [ [ 0.0, 5.0 ], [ 0.0, 3.0 ] ]
item 4
item 5  [ [ 0.0, 4.0 ], [ 0.0, 2.0 ] ]

I would use pandas nlargest, but I'm not sure how to make it accept a column that is a 2D array.

In reality, the 2D array holds thousands of value pairs and there are tens of thousands of rows. I'm open to better ways to hold this data that would be more versatile.

Has each element of 2DValueList the same number of elements? Or could each 2D array have its own individual number of pairs? — mrzo
– mrzo, Commented Dec 2, 2019 at 10:01
each row will have a varying number of pairs if that makes sense — Rory LM
– Rory LM, Commented Dec 3, 2019 at 2:41

Andy L. · Accepted Answer · 2019-12-02 13:14:09Z

1

If every cell of 2DValueList is list of lists, the efficient way is using heapq.nlargest with itemgetter together with list comprehension

from heapq import nlargest
from operator import itemgetter

df['new_list'] = [nlargest(2, x, key=itemgetter(1)) for x in df['2DValueList']]

Out[119]:
     Name               2DValueList            new_list
0  item 1  [[0, 1], [0, 6], [0, 2]]    [[0, 6], [0, 2]]
1  item 2  [[0, 2], [0, 1], [0, 1]]    [[0, 2], [0, 1]]
2  item 3  [[0, 1], [0, 3], [0, 5]]    [[0, 5], [0, 3]]
3  item 4  [[0, 4], [0, 1], [0, 2]]    [[0, 4], [0, 2]]

If each cell is a numpy 2darray, the above method still works fine. However, I think using numpy argsort would be better

df['new_list'] = [x[np.argsort(-x, axis=0)[:2,1]] for x in df['2DValueList']]

Out[128]:
     Name               2DValueList            new_list
0  item 1  [[0, 1], [0, 6], [0, 2]]    [[0, 6], [0, 2]]
1  item 2  [[0, 2], [0, 1], [0, 1]]    [[0, 2], [0, 1]]
2  item 3  [[0, 1], [0, 3], [0, 5]]    [[0, 5], [0, 3]]
3  item 4  [[0, 4], [0, 1], [0, 2]]    [[0, 4], [0, 2]]

Lastly, if you don't need the top n largest sub-array in sorted order, argpartition would be faster than argsort

edited Dec 2, 2019 at 13:14

answered Dec 2, 2019 at 12:34

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Rory LM Over a year ago

Each cell is a list of lists, so i'll try your first method. Thanks!

iamklaus · Accepted Answer · 2019-12-02 10:06:53Z

0

import ast

df['2DValueList'] = df['2DValueList'].apply(ast.literal_eval).apply(lambda x: sorted(x,reverse=True)[:2])

     Name       2DValueList
0  item 1  [[0, 6], [0, 2]]
1  item 2  [[0, 2], [0, 1]]
2  item 3  [[0, 5], [0, 3]]
3  item 4  [[0, 4], [0, 2]]

answered Dec 2, 2019 at 10:06

iamklaus

3,7682 gold badges14 silver badges21 bronze badges

Collectives™ on Stack Overflow

Pandas working with 2D array inside dataframe

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related