1

I have a data frame with multiple columns. Each column is a time series of some variable. I only want to pick columns that are significant (by some metric), i.e. I want to pick a subset of columns, s.t. for each column,

  1. the average(over all rows) is greater than x, or
  2. the max (over all rows) is greater than x

    i | col1 | col2 | col3 | ....

    0 | 0.1 | 0.5. | 0.3. | ....

    1 | .09 | 0.4 | 0.4 | ....

    2 | .08 | .45 | .36 | ...

Let's say, from the table above, I want to pick only [col2, col3] (with a condition: column_avg > 0.2 ).

Or, only col2, with a condition: column_avg>.4.

And similarly, instead of being conditional on the avg, make it conditional on min or max for each column

2
  • x is the same for all columns? Commented Aug 19, 2019 at 20:46
  • Yes. same condition over all columns. Commented Aug 19, 2019 at 20:47

2 Answers 2

4

Try this:

df2 = df[df.columns[df.mean(axis=0) > 0.2]]
df3 = df[df.columns[df.max(axis=0) > 0.4]]

df.min works the same way.

Sign up to request clarification or add additional context in comments.

1 Comment

Why no just use df2 = df.loc[:,df.mean(axis=0) > 0.2] ? I think its more straightforward than taking df.columns
2

If you want to get every column with a mean over .4:

means = df.mean()
x = .4
useful_cols = [ind for m,ind in zip(means,means.index) if m>x]
df2 = df[useful_cols]

With max you replace df.mean() for df.max()

Please tell me if there's something that needs explanation here.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.