1

I have a pandas dataframe with thousands of columns and I would like to perform the following operations for each column of the dataframe:

  1. check if the value i-th and i-1-th values are in the range (between x and y);
  2. if #1 is satisfied, then find log(i/i-1) ** 2 of the column;
  3. if #1 is not satisfied, assume 0;
  4. find the total of #2 for each column.

Here is a dataframe with a single column:

d = {'col1': [10, 15, 23, 16, 5, 14, 11, 4]}
df = pd.DataFrame(data = d)
df

x = 10 and y = 20

Here is what I can do for this single column:

df["IsIn"] = "NA" 
for i in range(1, len(df.col1)):
    if (x < df.col1[i] < y) & (x < df.col1[i - 1] < y):
        df.IsIn[i] = 1
    else:
        df.IsIn[i] = 0

df["rets"] = np.log(df["col1"] / df["col1"].shift(1))
df["var"] = df["IsIn"] * df["rets"]**2
Total = df["var"].sum()
Total

Ideally, I would have a (1 by n-cols) dataframe of Totals for each column. How can I best achieve this? I would also appreciate if you can supplement your answer with detailed explanation.

1 Answer 1

1

Yes, this is an instance where apply works. You only need to wrap your logic in a function. Also, consider between and shift on the condition to eliminate the first loop:

def func(s, x=10, y=20):
    '''
    compute the value given a series
    ''' 

    # mask where values are between x and y
    valid = s.between(x,y)

    # shift `valid` and double check
    valid = valid & valid.shift(fill_value=False)

    # squared log, mask with `valid`, and sum
    return (np.log(s/s.shift())**2 * valid).sum()

# apply `func` on the columns
df.apply(func, x=10, y=20)

Output:

col1    0.222561
dtype: float64
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.