3

I have a dataframe with 3 columns:

df:

x       y      z
334     290    3350.0
334     291    3350.5
334     292    3360.1
335     292    3360.1
335     292    3360.1
335     290    3351.0
335     290    3352.5
335     291    3333.1
335     291    3333.1
.
.

I'd like to check and parse values of each row from row = n to row = n+7 into a new dataframe based on a couple of conditions:

  1. df[n] != df[n+1]
  2. df[n] != df[n+3]
  3. df[n] != df[n+5]
  4. df['x'][n] < df['x'][n+2]
  5. df['x'][n] > df['x'][n+3]

If all of these are satisfied I want to write a new dataframe:

df_new = pd.concat([df[n], df[n+1], df[n+2], df[n+3], 
df[n+4], df[n+5], df[n+6], df[n+7]])

So the algorithm + output would look like:

for df[n] = 0:
1) [334     290    3350.0] != [334     291    3350.5]  True
2) [334     290    3350.0] != [335     292    3360.1]  True
3) [334     290    3350.0] != [335     290    3351.0]  True
4) 335 < 334  False
5) 335 > 335  False

So in this case it would skip the first iteration until we've gone down the entire length of the dataframe and made matches.

df_new(first iteration) = df_new.concat([....]) = empty row values

Is there an easy way to do this with speed in Pandas?

3
  • Not sure what you are trying to achieve, but this looks quite complex and prone to errors. Isn't there another way to achieve what you are trying to do? Commented May 27, 2019 at 18:59
  • I think it looks a bit more complicated than it actual is. Really I'm just trying to compare X, Y, Z values (together as a single row of 3 columns) or just X values (as a single row of 1 column) and making conditions on them. Is there anything helpful I can provide to help understand better? Commented May 27, 2019 at 19:04
  • I understand what you're doing, but imo try to take a step back and think about solving it differently. But maybe im wrong Commented May 27, 2019 at 19:14

2 Answers 2

4

A. Get the appropriate shifts:

    n1 = df.shift(-1)
    n2 = df.shift(-2)
    n3 = df.shift(-3)
    n5 = df.shift(-5)

B. Satisfy conditions 1, 2 and 3:

cond = (df != n1) & (df != n3) & (df != n5)

C. Satisfy conditions 4, 5:

 cond['holder'] = (df.x < n2.x) & (df.x < n3.x)

D. Get bool series (we want any row with all 'True'):

boolidx = cond.all(axis=1)

E. Use to get result:

df.loc[boolidx]
Sign up to request clarification or add additional context in comments.

1 Comment

cond.any(axis=1) or cond.all(axis=1)?
0

I changed slightly your sample data, to have one positive result:

df = pd.DataFrame(data=[
    [ 334, 290, 3350.0 ],
    [ 334, 291, 3350.5 ],
    [ 334, 292, 3360.1 ],
    [ 335, 292, 3360.1 ],
    [ 335, 292, 3360.1 ],
    [ 333, 290, 3351.0 ],
    [ 335, 290, 3352.5 ],
    [ 335, 291, 3333.1 ],
    [ 335, 291, 3333.1 ]], columns=['x', 'y', 'z'])

Then, for efficiency reason, I defined the following function:

def roll_win(a, win):
    shape = (a.shape[0] - win + 1, win, a.shape[1])
    strides = (a.strides[0],) + a.strides
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

It generates a 3-D table, where 2nd and 3rd dimension are "rolling windows" from the source Numpy array a. The size of window is win, sliding vertically. This way, processing of consecutive windows requires a loop running along the first axis of the generated table (see below).

Due to usage of as_strided function it runs significantly quicker than any "ordinary" Python loop (compare execution time with other solutions).

I couldn't use rolling windows provided by Pandas, because they were created to compute some statistics, not for calling any user function on the whole content of the current window.

Then I call this function:

tbl = roll_win(df.values, 7)

Note that Numpy array must have a single element type, so this type is "generalized" to float because one source column is of float type.

Then we have preparation steps for a loop processing each rolling window:

res = []    # Result container
idx = 0     # Rolling window index

The rest of the program is the loop:

while idx < len(tbl):
    tt = tbl[idx]  # Get the current rolling window (2-D)
    r0 = tt[0]     # Row 0
    # Condition
    cond = not((r0 == tt[1]).all() and (r0 == tt[3]).all()\
        and (r0 == tt[5]).all()) and tt[0][0] < tt[2][0]\
        and tt[0][0] > tt[3][0]
    if cond:   # OK
        # print(idx)
        # print(tt)
        res.extend(tt)  # Add to result
        idx += 7        # Skip the current result
    else:      # Failed
        idx += 1        # Next loop for the next window

In the "positive" case, I decided to start the next loop from the row following the current match (idx += 7), to avoid possible partially overlapping sets of source rows. If you don't want this feature, add 1 do idx in both cases.

For demonstration purpose, you can uncomment test printouts above.

The only remaining thing is to create the target DataFrame, from rows collected in res:

df2 = pd.DataFrame(data=res, columns=['x', 'y', 'z'], dtype=int)

Note that dtype=int will be obeyed only for x and y columns, because values in z column have fractional part.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.