1

I have the following Dataframe (1,2 millon rows):

df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})`

Now I try to find a sequences. Each "beginn "should match the first "end"where the distance based on column B is at least 40 occur. For the provided Dataframe that would mean: enter image description here

The sould problem is that Your help is highly appreciated.

4
  • 1
    What is your expected output? Indexes? What have you tried/investigated and why didn't that fulfill your requirements? Commented Oct 6, 2018 at 12:46
  • 1
    Is there only begin and end in the dataframe or something else? Should extra begin/end entries be ignored? Please post what you have tried already Commented Oct 6, 2018 at 12:57
  • 1
    Group number 2 only has a distance of 20, something is wrong with your example. Commented Oct 6, 2018 at 12:59
  • To be honest I am absolutley absolutely clueless. That is a part of an project which is due today and I have been working for around 12 h (Programming the UI etc.) . ^^therefore my head is prety empty Commented Oct 6, 2018 at 13:00

1 Answer 1

2

I will assume that as your output you want a list of sequences with the starting and ending value. The second sequence that you identify in your picture has a distance lower to 40, so I also assumed that that was an error.

import pandas as pd
from collections import namedtuple
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})

sequence_list = []
Sequence = namedtuple('Sequence', ['beginn', 'end'])

beginn_flag = False
beginn_value = 0
for i, row in df_test_2.iterrows():
    state = row['A']
    value = row['B']

    if not beginn_flag and state == 'beginn':
        beginn_flag = True
        beginn_value = value 
    elif beginn_flag and state == 'end':
        if value >= beginn_value + 40:
            new_seq = Sequence(beginn_value, value)
            sequence_list.append(new_seq)
            beginn_flag = False

 print(sequence_list)

This code outputs the following:

[Sequence(beginn=10, end=50), Sequence(beginn=70, end=110)]

Two sequences, one starting at 10 and ending at 50 and the other one starting at 70 and ending at 110.

Sign up to request clarification or add additional context in comments.

3 Comments

That is amazing. Thanks a lot
Thanks a lot again. Beside of the solution for my problem, I learned new a type "namedtuple"
Yeah, named tuples are great for creating a quick printable object: docs.python.org/3.6/library/…. Later on you can call its attribute like a class. Example: new_seq = Sequence(beginn=10, end=50) print(new_seq.beginn)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.