Python: extracting arrays from DataFrame column

Question

I'm trying to extract data from DataFrames as individual NumPy arrays to pass to SciPy stats methods.

Example DataFrame:

userId  numCol
147     1.3 
222     2.6
389     5.7 
443     1.2 
222     2.4
678     2.1
443     1.8
501     2.1
147     1.2
501     3.2
678     1.3
389     2.4

For the 6 unique userId's, let's say I only want to extract 4 separate arrays for the values of numCol for the userId's 147, 222, 389 and 443.

The output would look like this:

Array name 147: array([1.3, 1.2)]
Array name 222: array([2.6, 2.4)]
Array name 389: array([5.7, 2.4)]
Array name 443: array([1.2, 1.8)]

I'm wondering if the best approach would be to create a list for the userId's I want, then loop through the DataFrame utilising pandas isin and NumPy values.

I've looked at this similar question closely and it's not the same.

BrenBarn · Accepted Answer · 2016-01-17 19:38:43Z

1

You can get the rows corresponding to a particular userId with something like df[df.userId == 147]. So if you have a list of userIds you want, you could do something like:

for userId in userIds_to_check:
    stats.anderson(df[df.userId == userId].numCol)

(or whatever function you want to call instead of anderson). Note that usually you don't need to get a plain numpy array; you can call most stats functions on a pandas Series and they'll work just fine. If you do want a plain numpy array for some reason, you can do df[df.userId == userId].numCol.values.

Depending on what you're doing, you may want to just use groupby, which would allow you to just map a function onto every userId group, something like:

>>> df.groupby('userId').numCol.apply(stats.skew)

userId
147    0.000000e+00
222    0.000000e+00
389    3.954380e-16
443    0.000000e+00
501   -1.251190e-15
678   -8.673617e-16
Name: numCol, dtype: float64

Here I computed the skewness of the numCol values for every userId all in one fell swoop by applying stats.skew to each group.

answered Jan 17, 2016 at 19:38

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

RDJ Over a year ago

Brilliant answer, learnt so much! Would you mind just adding how to also extract the individual arrays, as I would like to have the flexibility depending on what I'm doing. Thanks.

BrenBarn Over a year ago

@Jonathan: The individual array is the df[df.userId == blah].numCol, where blah is whatever userId you want the values of.

RDJ Over a year ago

Thanks again - but the original idea was to extract the individual arrays through an iterative process, not just extract them one at a time. Sorry for any confusion.

BrenBarn Over a year ago

@Jonathan: Then just do what I did in my example with the for loop but remove the call to stats.anderson. That is: for userId in userIds_to_check: do_whatever_you_want_with(df[df.userId == userId].numCol)

BrenBarn Over a year ago

for userId in userids_to_check: df[df.userId == userId].numCol.values . But you cannot "return" something from a for loop. You have to do something with each value in the loop. Do you want to accumulate the numpy arrays into a list? If you are having trouble, please edit your question to say what exactly you want to happen to each numpy array, show the code you're using, and say what it's not doing that you want it to do.

|

Collectives™ on Stack Overflow

Python: extracting arrays from DataFrame column

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related