Parse output to dataframe

Question

I need to parse a BOT output and convert to a table format. Below is the link to the raw file and also it looks like this

https://www.dropbox.com/s/ab7sdl74krwltrd/raw_file.txt?dl=0

Invoice: 13613289,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 5: NM1~40~2~GEHC~~~~~46~GEHC,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 32: CLM~03X13613289-7~1968~~~24>B>1~Y~A~Y~Y,
HEAD Segment 40: REF~BDE~229~BROWARD COUNTY   AABC~219:BRD,
HEAD Segment 51: HI~ABK>N471,
HEAD Segment 54: HI~BP>54150,
"HEAD Segment 58: NM1~77~2~OUTPATIENT SURGICAL SERVICES, LTD~~~~~XX~1427012202",
HEAD Segment 60: N4~PLANTATION~FL~333241811,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00920>QZ>P2>X4>>~1968~MJ~65~~~1,
BODY Segment 3: DTP~472~D8~20210406,
BODY Segment 7: REF~6R~1,
,
END,
Invoice: 13828170,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 8: NM1~85~2~JLR MEDICAL GROUP~~~~~XX~1518910520,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 57: PRV~PE~PXC~207L00000X,
HEAD Segment 74: NM1~PR~2~TRICARE EAST REGION~~~~~PI~GET93776P,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00402>QK>X4>>>~1722~MJ~108~~~1>2,
BODY Segment 3: DTP~472~D8~20210426,
BODY Segment 7: REF~6R~1,
,
END,
Invoice: 13828171,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 6: HL~1~~20~1,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 41: REF~DDE~6~JLR MEDICAL GROUP ANESTHESIA~200:JLRAA,
HEAD Segment 57: PRV~PE~PXC~367H00000X,
HEAD Segment 70: OI~~~Y~B~~Y,
HEAD Segment 74: NM1~PR~2~TRICARE EAST REGION~~~~~PI~GET93776P,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00402>QX>X4>>>~1722~MJ~108~~~1>2,
BODY Segment 3: DTP~472~D8~20210426,
BODY Segment 7: REF~6R~1,
,
END,

It has an Invoice and then variable HEAD Segments from 1 to 100 and variable BODY Segments from 1 to 10. My thought is first create an empty pandas dataframe with columns Invoice, HEAD Segment 1, HEAD Segment 2,..... HEAD Segment 100, BODY Segment 1, BODY Segment 2,.....BODY Segment 10 and create a list for each parsed element and then map the list values based on column name of df. Lastly remove any df columns that have all null values. But I am having hard time mapping the content of list value based on df column name and inserting to df. I got it so far.

import pandas as pd

filename = 'Z:\\Projects\\Compliance BOT\\raw_file.txt'

a_list = ['Invoice']  
a_list.extend(['HEAD Segment '+ str(x) for x in range(1,100 + 1)])    
a_list.extend(['BODY Segment '+ str(x) for x in range(1,10 + 1)])   
df = pd.DataFrame(columns = a_list)

l = []
with open(filename, 'r', encoding='utf-8-sig') as input:
    for line in input:
        print(line.strip())
        line = line.strip()
        if line.startswith('"') and line.endswith('",'):
            line = line[1:-1]
        l.append(line.strip())        
        if '------------------------,' in l: l.remove('------------------------,')
        if ',' in l: l.remove(',')
        # l = [x[:-1] for x in l]
        l1 = [x[:-1] for x in l]
        if line.strip() == ',':
            break  
print(l)

The expected output should be like this

https://www.dropbox.com/scl/fi/j7mpwy84ne37mtyaprwp3/output.xlsx?dl=0&rlkey=lluo9lsc2rgcd4lxepwk0hb1u

Also please let me know if I am complicating this and there is any other simple way to do it?

upload "Completed_07202021.csv" and paste link here. Censor sensitive data if necessary before uploading. also, show your expected output here in your SO post body. — Joshua
– Joshua, Commented Jul 31, 2021 at 17:04
I uploaded the input raw file and also expected output. Thanks!! — Phani
– Phani, Commented Jul 31, 2021 at 17:43

Tejas · Accepted Answer · 2021-07-31 18:53:09Z

1

Try this code

import pandas as pd

filename = 'Completed_07202021.csv'

a_list = ['Invoice']  
a_list.extend(['HEAD Segment '+ str(x) for x in range(1,100 + 1)])    
a_list.extend(['BODY Segment '+ str(x) for x in range(1,10 + 1)])   
df = pd.DataFrame(columns = a_list)

l = []
l.append(df)
out={}
with open(filename, 'r', encoding='utf-8-sig') as input:
    for line in input:
        print(line)
        if line.strip() == ',':
            continue  
        line = line.strip()
        if line.startswith('"') and line.endswith('",'):
            line = line[1:-1]
        if 'END,' in line:
            l.append(pd.DataFrame.from_dict(out))
            out={}
        data = line.strip().split(':')
        if len(data)!=2:
            continue
        out_str = ':'.join(data[1:])
        out[data[0].strip()]=[out_str[:-1].strip() if out_str.endswith(',') else out_str.strip()]
        
df_out = pd.concat(l)
df_out

let me know if you have any issue with this code

edited Jul 31, 2021 at 18:53

answered Jul 31, 2021 at 17:37

Tejas

4114 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Phani Over a year ago

Thank you so much Tejas. You made it so simple.

Tejas Over a year ago

Just a suggestion, as you mentioned in problem that you want to remove the columns which have all NAN values, one simple way is, remove this line df = pd.DataFrame(columns = a_list), we are concating the dataframe so if the column value is available then only that column is added in master dataframe, so you dont need to do remove NAN stage again

Tejas Over a year ago

let me know if you want to update the code for that

Phani Over a year ago

One issue I have is when we have multiple ":" for ex HEAD Segment 40 for the first invoice. How can I fix that?

Phani Over a year ago

data = line.strip().split(':',1) did the trick for setting maxsplit = 1

Collectives™ on Stack Overflow

Parse output to dataframe

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related