0

I need to parse a BOT output and convert to a table format. Below is the link to the raw file and also it looks like this

https://www.dropbox.com/s/ab7sdl74krwltrd/raw_file.txt?dl=0

Invoice: 13613289,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 5: NM1~40~2~GEHC~~~~~46~GEHC,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 32: CLM~03X13613289-7~1968~~~24>B>1~Y~A~Y~Y,
HEAD Segment 40: REF~BDE~229~BROWARD COUNTY   AABC~219:BRD,
HEAD Segment 51: HI~ABK>N471,
HEAD Segment 54: HI~BP>54150,
"HEAD Segment 58: NM1~77~2~OUTPATIENT SURGICAL SERVICES, LTD~~~~~XX~1427012202",
HEAD Segment 60: N4~PLANTATION~FL~333241811,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00920>QZ>P2>X4>>~1968~MJ~65~~~1,
BODY Segment 3: DTP~472~D8~20210406,
BODY Segment 7: REF~6R~1,
,
END,
Invoice: 13828170,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 8: NM1~85~2~JLR MEDICAL GROUP~~~~~XX~1518910520,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 57: PRV~PE~PXC~207L00000X,
HEAD Segment 74: NM1~PR~2~TRICARE EAST REGION~~~~~PI~GET93776P,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00402>QK>X4>>>~1722~MJ~108~~~1>2,
BODY Segment 3: DTP~472~D8~20210426,
BODY Segment 7: REF~6R~1,
,
END,
Invoice: 13828171,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 6: HL~1~~20~1,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 41: REF~DDE~6~JLR MEDICAL GROUP ANESTHESIA~200:JLRAA,
HEAD Segment 57: PRV~PE~PXC~367H00000X,
HEAD Segment 70: OI~~~Y~B~~Y,
HEAD Segment 74: NM1~PR~2~TRICARE EAST REGION~~~~~PI~GET93776P,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00402>QX>X4>>>~1722~MJ~108~~~1>2,
BODY Segment 3: DTP~472~D8~20210426,
BODY Segment 7: REF~6R~1,
,
END,

It has an Invoice and then variable HEAD Segments from 1 to 100 and variable BODY Segments from 1 to 10. My thought is first create an empty pandas dataframe with columns Invoice, HEAD Segment 1, HEAD Segment 2,..... HEAD Segment 100, BODY Segment 1, BODY Segment 2,.....BODY Segment 10 and create a list for each parsed element and then map the list values based on column name of df. Lastly remove any df columns that have all null values. But I am having hard time mapping the content of list value based on df column name and inserting to df. I got it so far.

import pandas as pd

filename = 'Z:\\Projects\\Compliance BOT\\raw_file.txt'

a_list = ['Invoice']  
a_list.extend(['HEAD Segment '+ str(x) for x in range(1,100 + 1)])    
a_list.extend(['BODY Segment '+ str(x) for x in range(1,10 + 1)])   
df = pd.DataFrame(columns = a_list)

l = []
with open(filename, 'r', encoding='utf-8-sig') as input:
    for line in input:
        print(line.strip())
        line = line.strip()
        if line.startswith('"') and line.endswith('",'):
            line = line[1:-1]
        l.append(line.strip())        
        if '------------------------,' in l: l.remove('------------------------,')
        if ',' in l: l.remove(',')
        # l = [x[:-1] for x in l]
        l1 = [x[:-1] for x in l]
        if line.strip() == ',':
            break  
print(l)

The expected output should be like this

https://www.dropbox.com/scl/fi/j7mpwy84ne37mtyaprwp3/output.xlsx?dl=0&rlkey=lluo9lsc2rgcd4lxepwk0hb1u

Also please let me know if I am complicating this and there is any other simple way to do it?

2
  • upload "Completed_07202021.csv" and paste link here. Censor sensitive data if necessary before uploading. also, show your expected output here in your SO post body. Commented Jul 31, 2021 at 17:04
  • I uploaded the input raw file and also expected output. Thanks!! Commented Jul 31, 2021 at 17:43

1 Answer 1

1

Try this code

import pandas as pd

filename = 'Completed_07202021.csv'

a_list = ['Invoice']  
a_list.extend(['HEAD Segment '+ str(x) for x in range(1,100 + 1)])    
a_list.extend(['BODY Segment '+ str(x) for x in range(1,10 + 1)])   
df = pd.DataFrame(columns = a_list)

l = []
l.append(df)
out={}
with open(filename, 'r', encoding='utf-8-sig') as input:
    for line in input:
        print(line)
        if line.strip() == ',':
            continue  
        line = line.strip()
        if line.startswith('"') and line.endswith('",'):
            line = line[1:-1]
        if 'END,' in line:
            l.append(pd.DataFrame.from_dict(out))
            out={}
        data = line.strip().split(':')
        if len(data)!=2:
            continue
        out_str = ':'.join(data[1:])
        out[data[0].strip()]=[out_str[:-1].strip() if out_str.endswith(',') else out_str.strip()]
        
df_out = pd.concat(l)
df_out

let me know if you have any issue with this code

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you so much Tejas. You made it so simple.
Just a suggestion, as you mentioned in problem that you want to remove the columns which have all NAN values, one simple way is, remove this line df = pd.DataFrame(columns = a_list), we are concating the dataframe so if the column value is available then only that column is added in master dataframe, so you dont need to do remove NAN stage again
let me know if you want to update the code for that
One issue I have is when we have multiple ":" for ex HEAD Segment 40 for the first invoice. How can I fix that?
data = line.strip().split(':',1) did the trick for setting maxsplit = 1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.