Parse complex text file for data analysis in Python

Question

I am a complete novice to Python-or programming.

I have a text file to parse into a CSV. I am not able to provide an example of the text file at this time.

The text is several (thousand) lines with no carriage returns.
There are 4 types of records in the file (A, B, C, or I).
Each record type has a specific format based on the size of the data element.
There are no delimiters.
Immediately after the last data element in the record type, the next record type appears.
I have been trying to translate from a different language what this might look like in Python.

Here is an example of what I've written (not correct format)

file=open('TestPython.txt'), 'r' # from current working directory
dataString=file.read()
data=()
i=0
while i < len(dataString):
i = i+2
    curChar = dataString(i)
    # Need some help on the next line var curChar = dataString[i]

    if curChar = "A"
        NPI = dataString(i+1, 16) # Need to verify that is how it is done in python inside ()
            NPI.strip()
        PCN = datastring(i+17, 40)
            PCN.strip()
        seqNo = dataString(i+41, 42)
            seqNo.strip()
        MRN = dataString(i+43, 66)
            MRN.strip()
    if curChar = "B"
        NPI = dataString(i+1, 16) # Need to verify that is how it is done in python inside ()
            NPI.strip()
        PCN = datastring(i+17, 40)
            PCN.strip()
        seqNo = dataString(i+41, 42)
            seqNo.strip()
        RC1 = (i+43, 46)
            RC1.strip()
        RC2 = (i+47, 50)
            RC2.strip() 
        RC3 = (i+51, 54)
            RC3.strip()
    if curChar = "C"
        NPI = dataString(i+1, 16) # Need to verify that is how it is done in python inside ()
            NPI.strip()
        PCN = datastring(i+17, 40)
            PCN.strip()
        seqNo = dataString(i+41, 42)
            seqNo.strip()
        DXVer = (i=43, 43)
            DXVer.strip()
        AdmitDX = (i+44, 50)
            AdmitDX.strip()
        RVisit1 = (i+51, 57)
            RVisit1.strip()

Here's a Dummied-up version of a piece of the text file.

A 63489564696474677 9845687 777 67834717467764674 TUANU TINBUNIU 47 ERTYNU TDFGH UU748897764 66762589668777486U6764467467774767 7123609989 9 O
B 79466945684634677 676756787344786474634890 7746.66 7 96 4 7 7 9 7 774666 44969 494 7994 99666 77478 767766
B 098765477 64697666966667 9 99 87966 47798 797499
C 63489564696474677 6747494 7494 7497 4964 4976 N7469 4769 N9784 9677
I 79466944696474677 677769U6 8888 67764674
A 79466945684634677 6767994 777 696789989 6464467464764674 UIIUN UITTI 7747 NUU 9 ATU 4 UANU OSASDF NU67479 66567896667697487U6464467476777967 7699969978 7699969978 9 O

As you can see, there can be several of each type in the file. The way this example pastes, it looks like the type is the first character on a line. This is not the case on the actual file (i made this sample in Word).

You need to provide at least some kind of abstraction of the format or the question becomes unanswerable. — root
– root, Commented Jan 24, 2013 at 16:18
If I read this correctly, you start by pumping the entire file into a string. This is a bit wild, you should only read little bits into memory and process them. — flup
– flup, Commented Jan 24, 2013 at 16:22
You should possible try to use the python CSV module of python: docs.python.org/2/library/csv.html, which maybe allows you to read in the data in one line... — Alex
– Alex, Commented Jan 24, 2013 at 16:22
@flup: It's not a CSV file yet. It seems to be a stream of fixed-width datasets that he wants convert into a new CSV file. — Tim Pietzcker
– Tim Pietzcker, Commented Jan 24, 2013 at 16:29

seandavi · Accepted Answer · 2013-01-24 16:35:47Z

2

You might take a look at pyparsing.

answered Jan 24, 2013 at 16:35

seandavi

2,9884 gold badges31 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

flup Over a year ago

True, but it is not exactly beginner stuff.

flup · Accepted Answer · 2013-01-24 16:35:41Z

0

You better process the file as you read it.

First, do a file.read(1) to determine which type of record is up next.

Then, depending on the type, read the fields, which if I understand you correctly are fixed width. So for type 'A' this would look like this:

def processA (file):
    NPI = file.read(16).strip()  #assuming the NPI is 16 bytes long 
    PCN = file.read(23).strip()  #assuming the PCN is 23 bytes long
    seqNo = file.read(1).strip() #assuming seqNo is 1 byte long
    MRN = file.read(23).strip()  #assuming MRN is 23 bytes long
    return {"NPI":NPI,"PCN":PCN, "seqNo":seqNo, "MRN":MRN}

If the file is not ASCII, there's a bit more work to get the encoding right and read characters instead of bytes.

answered Jan 24, 2013 at 16:35

flup

27.2k8 gold badges56 silver badges75 bronze badges

3 Comments

JSamp Over a year ago

thanks. Yes, I was trying to turn the entire file into one string... Wild wasn't what I was going for. The def processA (file): looks like something to try.

JSamp Over a year ago

When I do the file.read(1) I get an attribute error: 'tuple' object has no attribute 'read'. Is this the text file is not ASCII? Or do I need to hit the tutorials again (I will anyway).

flup Over a year ago

No, there's a little typo when you open the file. It ought to read file=open('TestPython.txt', 'r'). Note the closing bracket. Your statement generates a tuple containing the open file and the 'r'.

Collectives™ on Stack Overflow

Parse complex text file for data analysis in Python

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related