0

I am getting stuck trying to figure out an efficient way to parse some plaintext that is structured with indents (from a word doc). Example (note: indentation below not rendering on mobile version of SO):

Attendance records 8 F 1921-2010 Box 2 1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010 Number of meetings attended each year 1 F 1991-1994 Box 2 Papers re: Safaris 10 F 1951-2011 Box 2 Incomplete; Includes correspondence about beginning “Safaris” may also include announcements, invitations, reports, attendance, and charges; some photographs. See also: Correspondence and Minutes

So the unindented text is the parent record data and each set of indented text below each parent data line are some notes for that data (which are also split into multiple lines themselves).

So far I have a crude script to parse out the unindented parent lines so that I get a list of dictionary items:

import re

f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

if line[0].isalpha():
        processed = re.split('\s{2,}', line)


        for i in processed:
        title = processed[0]
        rec_id = processed[1]
        years = processed[2]
        location = processed[3]

    records.append({
        "title": title,
        "id": rec_id,
        "years": years,
        "location": location
    })


elif not line[0].isalpha():

    print "These are the notes, but attaching them to the above records is not clear"


print records`

and this produces:

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010'}, {'id': '1 F', 'location': 'Box 2', 'title': 'Number of meetings attended each year', 'years': '1991-1994'}, {'id': '10 F', 'location': 'Box 2', 'title': 'Papers re: Safaris', 'years': '1951-2011'}]

But now I want to add to each record the notes to the effect of:

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010', 'notes': '1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010' }, ...]

What's confusing me is that I am assuming this procedural approach, line by line, and I'm not sure if there is a more Pythonic way to do this. I am more used to working with scraping webpages and with those at least you have selectors, here it's hard to double back going one by one down the line and I was hoping someone might be able to shake my thinking loose and provide a fresh view on a better way to attack this.

Update Just adding the condition suggested by answer below over the indented lines worked fine:

import re
import repr as _repr
from pprint import pprint


f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

    if line[0].isalpha():
        processed = re.split('\s{2,}', line)

        #print processed

        for i in processed:
            title = processed[0]
            rec_id = processed[1]
            years = processed[2]
            location = processed[3]

    if not line[0].isalpha():


        record['notes'].append(line)
        continue

    record = { "title": title,
               "id": rec_id,
               "years": years,
               "location": location,
               "notes": []}

    records.append(record)





pprint(records)
2
  • How many indentation levels do you have? Commented Apr 24, 2014 at 16:39
  • @Hyperboreus pretty much just the one level of indentation for the notes. Commented Apr 24, 2014 at 16:41

1 Answer 1

1

As you have already solved the parsing of the records, I will only focus on how to read the notes of each one:

records = []

with open('data.txt', 'r') as lines:
    for line in lines:
        if line.startswith ('\t'):
            record ['notes'].append (line [1:])
            continue
        record = {'title': line, 'notes': [] }
        records.append (record)

for record in records:
    print ('Record is', record ['title'] )
    print ('Notes are', record ['notes'] )
    print ()
Sign up to request clarification or add additional context in comments.

1 Comment

hmm great, that was easier than i thought. i thought i was having trouble with the nesting of the loops, but making the if checking for tabs condition on the same indent level as the check for parent line worked fine. thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.