How to split a row by specific string length in a dataframe in Python?

Question

I have a file like this:

system
1000
    1VEA      C    1   9.294  11.244  11.083
    1VEA     C1    2   9.324  11.375  11.161
    1VEA      H    3   9.243  11.396  11.232
...
 1203VEA    H2092601  20.738  16.293   7.837
 1203VEA    H2192602  20.900  16.225   7.869
 1203VEA    H2292603  20.822  16.330   7.989

I want to generate a dataframe which include 6 columns. I used following command to

    df = pd.read_csv('system.gro', skiprows=[0,1], delim_whitespace=True, header=None)

generate this dataframe. However, when it came to the row started with 1203, columns between H20 and 92601 has no white space and I cannot just use above command to split it. I used to split the line string by specific length like:

    f1 = open(fileName, 'r')
    for line in f1.readlines():
         atomName = line[8:15].strip(' ')
         globalIdx = int(line[15:20].strip(' '))

But it takes really long time to deal with the file. Does anyone has any idea about how to deal with this using dataframe?

This looks more like a data quality issue or something with the settings while exporting the file. Cant you ask for a file with an actualy delimiter, for example the | ? — Erfan
– Erfan, Commented May 8, 2019 at 0:07
instead of pd.read_csv use pd.read_fwf. I am not sure how the .strip() would work though. — SRT HellKitty
– SRT HellKitty, Commented May 8, 2019 at 0:47

Asmus · Accepted Answer · 2019-05-08 06:42:35Z

2

As suggested by SRT HellKitty in the comments, use pd.read_fwf (see docs) like this:

import pandas as pd

data="""
   1VEA      C    1   9.294  11.244  11.083
   1VEA     C1    2   9.324  11.375  11.161
   1VEA      H    3   9.243  11.396  11.232
1203VEA    H2092601  20.738  16.293   7.837
1203VEA    H2192602  20.900  16.225   7.869
1203VEA    H2292603  20.822  16.330   7.989
"""

### make sure that the widths are correct!
df=pd.read_fwf(pd.compat.StringIO(data),colspecs=[(0,8),(8,14),(14,20),(20,28),(28,36),(36,44)])
print(df)

edited May 8, 2019 at 6:42

answered May 8, 2019 at 6:35

Asmus

5,2771 gold badge18 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Asmus Over a year ago

@jezrael Thanks, I wasn't aware of that, as I usually seldom read from string :) and I've updated my answer accordingly.

William Huang Over a year ago

This is exactly what I want! Really appreciate about the answers!

Collectives™ on Stack Overflow

How to split a row by specific string length in a dataframe in Python?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related