Loop through rows of dataframe using re.compile().split()

Question

I have a dataframe taht consists of 1 column and several rows. Each of these rows is constructed in the same way: -timestamp- value1 value2 value3 -timestamp- value 4 value5 value6 ...

The timestamps have this format: YYYY-MM-DD HH:MM:SS and the values are number with 2 decimals. I would like to make a new dataframe that has the individual timestamps in one row and the related values in the next row.

I managed to get the expected result linewise with regex but not for the entire dataframe.

My code so far:

#input dataframe
data.head()

                  values
0   2020-05-12 10:00:00 12.07 13 11.56 ... 2020-05-12 10:00:01 11.49 17 5.67...
1   2020-05-12 10:01:00 11.49 17 5.67 ... 2020-05-12 10:01:01 12.07 13 11.56...
2   2020-05-12 10:02:00 14.29 18 11.28 ... 2020-05-12 10:02:01 13.77 18 7.43...


test = data['values'].iloc[0] #first row of data
row1 = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(test)
df_row1 = pd.DataFrame(row1)

df_row1.head()

             values 
0   2020-05-12 10:00:00
1   12.07 13.79 15.45 17.17 18.91 14.91 12.35 14....
2   2020-05-12 10:00:01
3   12.48 13.96 13.88 15.57 18.46 15.0 13.65 14.6...

#trying the same for the entire dataframe 
for row in data:
    df_new = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(row)

print(df_new)
['values']

My question now is how can I loop through the rows of my dataframe and get the expected result?

JQadrad · Accepted Answer · 2020-05-14 14:17:42Z

In case you want to first split the lines and extract the values to columns, be aware you can use str.extract. Using named grouping in your regular expression it will automatically assign the columns for your dataframe

split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"
extract_values = r"(?P<date>\d{4}-\d{2}-\d{2})\s(?P<time>\d{2}:\d{2}:\d{2})\s(?P<value_one>.*?)\s(?P<value_two>.*?)\s(?P<value_three>.*?)$"

df = pd.DataFrame([{
    "value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67", 
},{
    "value": "2020-05-13 10:00:00 14.07 13 15.56 2020-05-16 10:00:02 11.51 18 5.69", 
}])
df = df["value"].str.split(split_line).explode().str.extract(extract_values, expand=True)
print(df)
#          date      time value_one value_two value_three
# 0  2020-05-12  10:00:00     12.07        13       11.56
# 0  2020-06-12  11:00:00     13.07        16       11.16
# 0  2020-05-12  10:00:01     11.49        17        5.67
# 1  2020-05-13  10:00:00     14.07        13       15.56
# 1  2020-05-16  10:00:02     11.51        18        5.69

In case you do not know the number of groups after the date and time use split rather than a regular expression. I would suggest something like this:

split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"

df = pd.DataFrame([{
    "value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67", 
},{
    "value": "2020-05-13 10:00:00 14.07 13 14 15 15.56 2020-05-16 10:00:02 11.51 18 5.69", 
}])
df = df["value"].str.split(split_line).explode().reset_index()

df = df['value'].str.split(" ").apply(pd.Series)
df.columns = [f"col_{col}" for col in df.columns]
print(df)
#         col_0     col_1  col_2 col_3  col_4 col_5  col_6
# 0  2020-05-12  10:00:00  12.07    13  11.56   NaN    NaN
# 1  2020-06-12  11:00:00  13.07    16  11.16   NaN    NaN
# 2  2020-05-12  10:00:01  11.49    17   5.67   NaN    NaN
# 3  2020-05-13  10:00:00  14.07    13     14    15  15.56
# 4  2020-05-16  10:00:02  11.51    18   5.69   NaN    NaN

Thank you! This helped a lot! Your idea with the automatically assigned columns is very nice, but in my real dataframe I have over 2000 values that would need >2000 columns and writing (?P<value_X>.*?) 2000 times is not very efficient. Can you think of a shorter version? It would be okay if the column names are a number array, e.g. 0-2000?

Shubham Sharma · Accepted Answer · 2020-05-14 12:59:22Z

You don't need to loop through the rows to get the result instead, you can use Series.str.split to split the given series around the delimiter, the delimiter in this case would be a regular expression. Then you can use DataFrame.explode to transform each element in a list-likes to seperate rows.

Use:

data["values"] = data["values"].str.split(r'\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})')
data = data.explode("values")
data["values"] = data["values"].str.split(r'(?<=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})\s+')
data = data.explode("values").reset_index(drop=True)

print(data)

This resulting dataframe data should look like:

        values
0   2020-05-12 10:00:00
1        12.07 13 11.56
2   2020-05-12 10:00:01
3         11.49 17 5.67
4   2020-05-12 10:01:00
5         11.49 17 5.67
6   2020-05-12 10:01:01
7        12.07 13 11.56
8   2020-05-12 10:02:00
9        14.29 18 11.28
10  2020-05-12 10:02:01
11        13.77 18 7.43

Collectives™ on Stack Overflow

Loop through rows of dataframe using re.compile().split()

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related