Pandas Read CSV with string delimiters via regex

Question

I am trying to import a weirdly formatted text file into a pandas DataFrame. Two example lines are below:

LOADED LANE       1   MAT. TYPE=    2    LEFFECT=    1    SPAN=  200.    SPACE=   10.    BETA=   3.474 LOADEFFECT 5075.    LMAX= 3643.    COV=  .13
LOADED LANE       1   MAT. TYPE=    3    LEFFECT=    1    SPAN=  200.    SPACE=   10.    BETA=   3.515 LOADEFFECT10009.    LMAX= 9732.    COV=  .08

First I tried the following:

df = pd.read_csv('beta.txt', header=None, delim_whitespace=True, usecols=[2,5,7,9,11,13,15,17,19])

This seemed to work fine, however got messed up when it hit the above example line, where there is no whitespace after the LOADEFFECT string (you may need to scroll a bit right to see it in the example). I got a result like:

632   1   2   1  200  10  3.474  5075.  3643.  0.13
633   1   3   1  200  10  3.515  LMAX=   COV=   NaN

Then I decided to use a regular expression to define my delimiters. After many trial and error runs (I am no expert in regex), I managed to get close with the following line:

df = pd.read_csv('beta.txt', header=None, sep='/s +|LOADED LANE|MAT. TYPE=|LEFFECT=|SPAN=|SPACE=|BETA=|LOADEFFECT|LMAX=|COV=', engine='python')

This almost works, but creates a NaN column for some reason at the very beginning:

632 NaN  1  2  1  200  10  3.474   5075  3643  0.13
633 NaN  1  3  1  200  10  3.515  10009  9732  0.08

At this point I think I can just delete that first column, and get away with it. However I wonder what would be the correct way to set up the regex to correctly parse this text file in one shot. Any ideas? Other than that, I am sure there is a smarter way to parse this text file. I would be glad to hear your recommendations.

Thanks!

That looks more like a fixed-width format to me. If you draw vertical lines in your file, does the data stay in the right columns? — DSM
– DSM, Commented Apr 22, 2015 at 16:49
read_fwf like DSM has mentioned can read this if the format is fixed-width, it accepts a width param where you can pass a list of each column width — EdChum
– EdChum, Commented Apr 22, 2015 at 16:50
@EdChum I didn't know about read_fwf I think that will fix my problem! — marillion
– marillion, Commented Apr 22, 2015 at 16:59
Also not sure but your initial issue may be related to line endings? "\r\n" vs "\n" — dartdog
– dartdog, Commented Apr 22, 2015 at 17:09

BongoClue · Accepted Answer · 2016-02-20 06:48:08Z

1

import re
import pandas as pd
import csv
csvfile = open("parsing.txt") #open text file
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
    for i in line:
        new_list.append(re.findall(r'(\d*\.\d+|\d+)', i))

table = pd.DataFrame(new_list)
table # output will be pandas DataFrame with values

answered Feb 20, 2016 at 6:48

BongoClue

234 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas Read CSV with string delimiters via regex

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related