0

I want to convert plain structured text files to the CSV format using Python.

The input looks like this

[-------- 1 -------]
Version: 2
 Stream: 5
 Account: A
[...]
[------- 2 --------]
 Version: 3
 Stream: 6
 Account: B
[...]

The output is supposed to look like this:

Version; Stream; Account; [...]
2; 5; A; [...]
3; 6; B; [...]

I.e. the input is structured text records delimited by [----<sequence number>----] and containing <key>: <values>-pairs and the ouput should be CSV containing one record per line.

I am able to retrive the <key>: <values>-pairs into CSV format via

colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')

-- but I have trouble to recognize beginning and end of the structured text records and with the re-writing as CSV line-records. Furthermore I would like to be able to separate different type of records, i.e. distinguish between - say - Version: 2 and Version: 3 type of records.

6
  • Your input file is not a CSV format; it is structured, but not delimiter-separated. Your output is. Commented Oct 17, 2013 at 21:04
  • And what do you expect to do with the different versions of records? Commented Oct 17, 2013 at 21:06
  • The different type of records have a different number of elements. Commented Oct 17, 2013 at 21:08
  • ah, that makes a difference; your output then is not strictly CSV either. My answer below assumed the records were the same size each. Commented Oct 17, 2013 at 21:13
  • Do you know what fields are used beforehand? Or do you need to collect those first from the input file? Commented Oct 17, 2013 at 21:16

1 Answer 1

1

Reading the list is not that hard:

def read_records(iterable):
    record = {}
    for line in iterable:
        if line.startswith('[------'):
            # new record, yield previous
            if record:
                yield record
            record = {}
            continue
        key, value = line.strip().split(':', 1)
        record[key.strip()] = value.strip()

    # file done, yield last record
    if record:
        yield record

This produces dictionaries from your input file.

From this you can produce CSV output using the csv module, specifically the csv.DictWriter() class:

# List *all* possible keys, in the order the output file should list them
headers = ('Version', 'Stream', 'Account', ...)

with open(inputfile) as infile, open(outputfile, 'wb') as outfile:
    records = read_records(infile)

    writer = csv.DictWriter(outfile, headers, delimiter=';')
    writer.writeheader()

    # and write
    writer.writerows(records)

Any header keys missing from a record will leave that column empty for that record. Any extra headers you missed will raise an exception; either add those to the headers tuple, or set the extrasaction keyword to the DictWriter() constructor to 'ignore'.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks for the valuable&well explained hints. I do have a working prototype now. There's one problem, still. With the full number of headers (approx. 100) there is no proper output generated just one line of wrongly mapped fields: Is there a limitation to csv(headers)?
Not that I know of; sounds like something else might be wrong instead.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.