Python: parsing structured text to CSV format

Question

I want to convert plain structured text files to the CSV format using Python.

The input looks like this

[-------- 1 -------]
Version: 2
 Stream: 5
 Account: A
[...]
[------- 2 --------]
 Version: 3
 Stream: 6
 Account: B
[...]

The output is supposed to look like this:

Version; Stream; Account; [...]
2; 5; A; [...]
3; 6; B; [...]

I.e. the input is structured text records delimited by [----<sequence number>----] and containing <key>: <values>-pairs and the ouput should be CSV containing one record per line.

I am able to retrive the <key>: <values>-pairs into CSV format via

colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')

-- but I have trouble to recognize beginning and end of the structured text records and with the re-writing as CSV line-records. Furthermore I would like to be able to separate different type of records, i.e. distinguish between - say - Version: 2 and Version: 3 type of records.

Your input file is not a CSV format; it is structured, but not delimiter-separated. Your output is. — Martijn Pieters
– Martijn Pieters, Commented Oct 17, 2013 at 21:04
And what do you expect to do with the different versions of records? — Martijn Pieters
– Martijn Pieters, Commented Oct 17, 2013 at 21:06
The different type of records have a different number of elements. — felix.krull
– felix.krull, Commented Oct 17, 2013 at 21:08
ah, that makes a difference; your output then is not strictly CSV either. My answer below assumed the records were the same size each. — Martijn Pieters
– Martijn Pieters, Commented Oct 17, 2013 at 21:13
Do you know what fields are used beforehand? Or do you need to collect those first from the input file? — Martijn Pieters
– Martijn Pieters, Commented Oct 17, 2013 at 21:16

Martijn Pieters · Accepted Answer · 2013-10-17 21:35:36Z

1

Reading the list is not that hard:

def read_records(iterable):
    record = {}
    for line in iterable:
        if line.startswith('[------'):
            # new record, yield previous
            if record:
                yield record
            record = {}
            continue
        key, value = line.strip().split(':', 1)
        record[key.strip()] = value.strip()

    # file done, yield last record
    if record:
        yield record

This produces dictionaries from your input file.

From this you can produce CSV output using the csv module, specifically the csv.DictWriter() class:

# List *all* possible keys, in the order the output file should list them
headers = ('Version', 'Stream', 'Account', ...)

with open(inputfile) as infile, open(outputfile, 'wb') as outfile:
    records = read_records(infile)

    writer = csv.DictWriter(outfile, headers, delimiter=';')
    writer.writeheader()

    # and write
    writer.writerows(records)

Any header keys missing from a record will leave that column empty for that record. Any extra headers you missed will raise an exception; either add those to the headers tuple, or set the extrasaction keyword to the DictWriter() constructor to 'ignore'.

edited Oct 17, 2013 at 21:35

answered Oct 17, 2013 at 21:12

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

felix.krull Over a year ago

thanks for the valuable&well explained hints. I do have a working prototype now. There's one problem, still. With the full number of headers (approx. 100) there is no proper output generated just one line of wrongly mapped fields: Is there a limitation to csv(headers)?

Martijn Pieters Over a year ago

Not that I know of; sounds like something else might be wrong instead.

Collectives™ on Stack Overflow

Python: parsing structured text to CSV format

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related