0

I'm trying to extract data from DataFrames as individual NumPy arrays to pass to SciPy stats methods.

Example DataFrame:

userId  numCol
147     1.3 
222     2.6
389     5.7 
443     1.2 
222     2.4
678     2.1
443     1.8
501     2.1
147     1.2
501     3.2
678     1.3
389     2.4

For the 6 unique userId's, let's say I only want to extract 4 separate arrays for the values of numCol for the userId's 147, 222, 389 and 443.

The output would look like this:

Array name 147: array([1.3, 1.2)]
Array name 222: array([2.6, 2.4)]
Array name 389: array([5.7, 2.4)]
Array name 443: array([1.2, 1.8)]

I'm wondering if the best approach would be to create a list for the userId's I want, then loop through the DataFrame utilising pandas isin and NumPy values.

I've looked at this similar question closely and it's not the same.

1 Answer 1

1

You can get the rows corresponding to a particular userId with something like df[df.userId == 147]. So if you have a list of userIds you want, you could do something like:

for userId in userIds_to_check:
    stats.anderson(df[df.userId == userId].numCol)

(or whatever function you want to call instead of anderson). Note that usually you don't need to get a plain numpy array; you can call most stats functions on a pandas Series and they'll work just fine. If you do want a plain numpy array for some reason, you can do df[df.userId == userId].numCol.values.

Depending on what you're doing, you may want to just use groupby, which would allow you to just map a function onto every userId group, something like:

>>> df.groupby('userId').numCol.apply(stats.skew)

userId
147    0.000000e+00
222    0.000000e+00
389    3.954380e-16
443    0.000000e+00
501   -1.251190e-15
678   -8.673617e-16
Name: numCol, dtype: float64

Here I computed the skewness of the numCol values for every userId all in one fell swoop by applying stats.skew to each group.

Sign up to request clarification or add additional context in comments.

7 Comments

Brilliant answer, learnt so much! Would you mind just adding how to also extract the individual arrays, as I would like to have the flexibility depending on what I'm doing. Thanks.
@Jonathan: The individual array is the df[df.userId == blah].numCol, where blah is whatever userId you want the values of.
Thanks again - but the original idea was to extract the individual arrays through an iterative process, not just extract them one at a time. Sorry for any confusion.
@Jonathan: Then just do what I did in my example with the for loop but remove the call to stats.anderson. That is: for userId in userIds_to_check: do_whatever_you_want_with(df[df.userId == userId].numCol)
for userId in userids_to_check: df[df.userId == userId].numCol.values . But you cannot "return" something from a for loop. You have to do something with each value in the loop. Do you want to accumulate the numpy arrays into a list? If you are having trouble, please edit your question to say what exactly you want to happen to each numpy array, show the code you're using, and say what it's not doing that you want it to do.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.