0

I am parsing through an ISI file with a few hundred records that all begin with a 'PT J' tag and end with an 'ER' tag. I am trying to pull the tagged info from each record within a nested loop but keep getting an IndexError. I know why I am getting it, but does anyone have a better way of identifying the start of new records than checking the first few characters?

    while file:
        while line[1] + line[2] + line[3] + line[4] != 'PT J':
            ...                
            Search through and record data from tags
            ...

I am using this same method and therefore occasionally getting the same problem with identifying tags, so if you have any suggestions for that as well I would greatly appreciate it!

Sample data, which you'll notice does not always include every tag for each record, is:

    PT J
    AF Bob Smith
    TI Python For Dummies
    DT July 4, 2012
    ER

    PT J
    TI Django for Dummies
    DT 4/14/2012
    ER

    PT J
    AF Jim Brown
    TI StackOverflow
    ER
1
  • I would like to point out that I am converting this to a .txt as well before reading it. Commented Jul 6, 2012 at 2:47

3 Answers 3

3
with open('data1.txt') as f:
    for line in f:
        if line.strip()=='PT J':
            for line in f:
                if line.strip()!='ER' and line.strip():
                    #do something with data
                elif line.strip()=='ER':
                     #this record ends here move to the next record
                     break
Sign up to request clarification or add additional context in comments.

1 Comment

I think I see what's going on here, however, how would I access different lines to manipulate or test them? Since line is acting as an iterator, we can't say within the nested 'if' statement something to the effect of line=file.readline() What would replace the line=file.readline() to allow me to get to specific lines??? I ask because in some instances there are multiple entities per tag.
2

Do the 'ER' lines only contain 'ER'? That would be why you're getting IndexErrors, because line[4] doesn't exist.

The first thing to to try would be:

while not line.startswith('PT J'):

instead of your existing while loop.

Also, slices:

line[1] + line[2] + line[3] + line[4] == line[1:5] 

(The ends of slices are noninclusive)

2 Comments

Yes, 'ER' (End of Record) lines typically do not contain anything else, not even trailing spaces.
I like your suggestion...I will have to play more with it.
0

You could try an approach like this to read through your file.

with open('data.txt') as f:
    for line in f:
        line = line.split() # splits your line into a list of character sequences
                            # separated based on whitespace (blanks, tabs)
        llen = len(line)
        if llen == 2 and line[0] == 'PT' and line[1] == 'J': # found start of record
           # process
           # examine line[0] for 'tags', such as "AF", "TI", "DT" and proceed
           # as dictated by your needs. 
           # e.g., 

        if llen > 1 and line[0] == "AF": # grab first/last name in line[1] and line[2]

           # The data will be on the same line and
           # accessible via the correct index values.

        if lline == 1 and line[0] == 'ER': # found end of record.

This definitely needs more "programming logic" (most likely embedded loops, or better yet, calls to functions) to put everything in the right order/sequence, but the basic constructs are there and I hope will get you started and gives you some ideas.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.