Regex text to pandas dataframe

Question

I have a text file that contains multiple lines in the format given below:

real    0m0.020s
user    0m0.000s
sys 0m0.000s
Round  1  completed. with matrix size of  1200 x 1200 with threads 8

real    0m0.022s
user    0m0.000s
sys 0m0.001s
Round  2  completed. with matrix size of  1200 x 1200 with threads 8

There are about 500 entries of the this sort(above is an example of 2). I can't seem to figure out how to get them into a pandas dataframe that might look something like this:

Matrix Size    Threads    Round    Real    User    Sys
1200 x 1200    8          1        0.0020  0.0000  0.0000
1200 x 1200    8          2        0.0022  0.0000  0.0001

Is there a way using regex or some other way to convert the test output into a dataframe. Additionally I don't know if I interpreted the times correctly either as they are in 0m(I think 0 minutes) and the 0.02 (I think 0.02 seconds)

Are there always two newlines between blocks that will each form a row of the dataframe? — gmds
– gmds, Commented Apr 25, 2019 at 1:11
I bet the time you ask this question and wait for answer is enough for you to create and run a simple for loop solution on that 500 entries :-) — Quang Hoang
– Quang Hoang, Commented Apr 25, 2019 at 1:12
Yeah, each block will forma a record and there are two new lines between them — user9996043
– user9996043, Commented Apr 25, 2019 at 1:12

gmds · Accepted Answer · 2019-04-25 01:45:36Z

3

You can use a regex:

import re
import pandas as pd

regex = re.compile(r'real +(\dm\d\.\d+s)\nuser +(\dm\d\.\d+s)\nsys +(\dm\d\.\d+s)\nRound +(\d+).+of +(\d+ x \d+).+threads (\d+)')

df = pd.DataFrame(regex.findall(data), columns=['real', 'user', 'sys', 'round', 'matrix size', 'threads'])

print(df)

Output:

       real      user       sys round  matrix size threads
0  0m0.020s  0m0.000s  0m0.000s     1  1200 x 1200       8
1  0m0.022s  0m0.000s  0m0.001s     2  1200 x 1200       8

answered Apr 25, 2019 at 1:45

gmds

20k4 gold badges37 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user9996043 Over a year ago

Is there a way i could convert the 0m0.020s to (0*60)[from the m] + (0.020)[from the s]

gmds Over a year ago

@user9996043 How about df['real'].str.replace('s', '').str.split('m').map(lambda t: float(t[0]) * 60 + float(t[1]))?

It_is_Chris · Accepted Answer · 2019-04-25 02:23:52Z

If you want to solve the problem using only pandas you can use str.split():

# data
s = """real    0m0.020s
user    0m0.000s
sys 0m0.000s
Round  1  completed. with matrix size of  1200 x 1200 with threads 8

real    0m0.022s
user    0m0.000s
sys 0m0.001s
Round  2  completed. with matrix size of  1200 x 1200 with threads 8"""

# str.split on two line breaks for rows then split on the text
df = pd.DataFrame(s.split('\n\n'))[0].str.split('   |real | with |user    |sys |matrix size of  |threads |\n')\
                                  .apply(lambda x: [s for s in x if s]).apply(pd.Series)

# split col 3 on round and completed to get number of rounds
df[3] = df[3].str.strip('Round | completed.')

# rename columns
df.columns = ['real', 'user', 'sys', 'round', 'matrix size', 'threads']

out

       real      user       sys round  matrix size threads
0  0m0.020s  0m0.000s  0m0.000s     1  1200 x 1200       8
1  0m0.022s  0m0.000s  0m0.001s     2  1200 x 1200       8

note that it will be slower gmds' example:

1000 loops, best of 3: 4.42 ms per loop vs 1000 loops, best of 3: 1.84 ms per loop

Collectives™ on Stack Overflow

Regex text to pandas dataframe

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related