How to read a file with variable multi-row data in Python

Question

I have a file that is about 100Mb that looks like this:

#meta data 1    
skadjflaskdjfasljdfalskdjfl
sdkfjhasdlkgjhsdlkjghlaskdj
asdhfk
#meta data 2
jflaksdjflaksjdflkjasdlfjas
ldaksjflkdsajlkdfj
#meta data 3
alsdkjflasdjkfglalaskdjf

This file contains one row of meta data that corresponds to several, variable length data containing only alpha-numeric characters. What is the best way to read this data into a simple list like this:

data = [[#meta data 1, skadjflaskdjfasljdfalskdjflsdkfjhasdlkgjhsdlkjghlaskdjasdhfk],
       [#meta data 2, jflaksdjflaksjdflkjasdlfjasldaksjflkdsajlkdfj],
       [#meta data 3, alsdkjflasdjkfglalaskdjf]]

My initial idea was to use the read() method to read the whole file into memory and then use regular expressions to parse the data into the desired format. Is there a better more pythonic way? All metadata lines start with an octothorpe and all data lines are all alpha-numeric. Thanks!

unutbu · Accepted Answer · 2011-11-13 17:44:06Z

4

itertools.groupby provides an easy way to collect lines into groups:

import itertools

data=[]
with open('data.txt','r') as f:
    for key,group in itertools.groupby(f,lambda line: line.startswith('#meta')):
        if key:
            meta=next(group).strip()
        else:
            lines=''.join(group).strip()
            data.append((meta,lines))
print(data)

yields

[('#meta data 1', 'skadjflaskdjfasljdfalskdjfl\nsdkfjhasdlkgjhsdlkjghlaskdj\nasdhfk'), ('#meta data 2', 'jflaksdjflaksjdflkjasdlfjas\nldaksjflkdsajlkdfj'), ('#meta data 3', 'alsdkjflasdjkfglalaskdjf')]

The expression

itertools.groupby(f,lambda line: line.startswith('#meta'))

returns an iterator. It loops through the lines in f, and calls the lambda function on each line. When it encounters a line that begins with #meta, that function returns True, otherwise False.

itertools.groupby collects all the contiguous lines that return the same value.

So the line that begins with #meta is placed in its own group, then all the subsequent lines not beginning with #meta are placed in the next group, and so on.

The key is the return value from the lambda function. In this case, it will be either True or False.

answered Nov 13, 2011 at 17:44

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

drbunsen Over a year ago

Wow this is great! The only thing I'm having difficulty with is my output gives me [(False, 'skadjflaskdjfasljdfalskdjfl\nsdkfjhasdlkgjhsdlkjghlaskdj\nasdhfk')... I can't seem to see why I get a boolean and why it is false?!

unutbu Over a year ago

It looks like perhaps you are printing the key rather than meta? Are you using data.append((key,lines))? If so, change key --> meta.

Cat Plus Plus · Accepted Answer · 2011-11-13 17:45:30Z

1

I don't know whether this will be the fastest way, but from the top of my head:

data = []
with open('input.file', 'r') as fp:
    for line in fp:
        line = line.strip()
        if line[0] == '#':
            data.append((line, []))
        else:
            data[-1][1].append(line)
data = [(X, ''.join(Y)) for X, Y in data]

answered Nov 13, 2011 at 17:45

Cat Plus Plus

131k27 gold badges205 silver badges226 bronze badges

1 Comment

drbunsen Over a year ago

Thanks, this was a cool answer. I've never thought to do it this way.

Mathieu Mahé · Accepted Answer · 2011-11-13 17:42:04Z

0

I guess something like that:

result = []
for line in file.readlines():
    if line[0] == '#':
        result.append([line])
    else:
        if len(result[-1]) == 1:
            result[-1].append(line)
        else:
            result[-1][-1] += line

Not tested.

answered Nov 13, 2011 at 17:42

Mathieu Mahé

2,7443 gold badges37 silver badges53 bronze badges

Comments

John Zwinck · Accepted Answer · 2011-11-13 17:44:25Z

0

I'd keep it simple, something like:

data = [] # result
lastmeta = None # the last metadata line seen
chunks = [] # lines since the last metadata line
for line in input:
    if line[0] == '#': # metadata
        if lastmeta: # need to flush data we've collected
            data.append((lastmeta, ''.join(chunks))
        lastmeta = line
    else:
        chunks.append(line)

answered Nov 13, 2011 at 17:44

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

Collectives™ on Stack Overflow

How to read a file with variable multi-row data in Python

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related