2

I have a file that is about 100Mb that looks like this:

#meta data 1    
skadjflaskdjfasljdfalskdjfl
sdkfjhasdlkgjhsdlkjghlaskdj
asdhfk
#meta data 2
jflaksdjflaksjdflkjasdlfjas
ldaksjflkdsajlkdfj
#meta data 3
alsdkjflasdjkfglalaskdjf

This file contains one row of meta data that corresponds to several, variable length data containing only alpha-numeric characters. What is the best way to read this data into a simple list like this:

data = [[#meta data 1, skadjflaskdjfasljdfalskdjflsdkfjhasdlkgjhsdlkjghlaskdjasdhfk],
       [#meta data 2, jflaksdjflaksjdflkjasdlfjasldaksjflkdsajlkdfj],
       [#meta data 3, alsdkjflasdjkfglalaskdjf]]

My initial idea was to use the read() method to read the whole file into memory and then use regular expressions to parse the data into the desired format. Is there a better more pythonic way? All metadata lines start with an octothorpe and all data lines are all alpha-numeric. Thanks!

4 Answers 4

4

itertools.groupby provides an easy way to collect lines into groups:

import itertools

data=[]
with open('data.txt','r') as f:
    for key,group in itertools.groupby(f,lambda line: line.startswith('#meta')):
        if key:
            meta=next(group).strip()
        else:
            lines=''.join(group).strip()
            data.append((meta,lines))
print(data)            

yields

[('#meta data 1', 'skadjflaskdjfasljdfalskdjfl\nsdkfjhasdlkgjhsdlkjghlaskdj\nasdhfk'), ('#meta data 2', 'jflaksdjflaksjdflkjasdlfjas\nldaksjflkdsajlkdfj'), ('#meta data 3', 'alsdkjflasdjkfglalaskdjf')]

The expression

itertools.groupby(f,lambda line: line.startswith('#meta'))

returns an iterator. It loops through the lines in f, and calls the lambda function on each line. When it encounters a line that begins with #meta, that function returns True, otherwise False.

itertools.groupby collects all the contiguous lines that return the same value.

So the line that begins with #meta is placed in its own group, then all the subsequent lines not beginning with #meta are placed in the next group, and so on.

The key is the return value from the lambda function. In this case, it will be either True or False.

Sign up to request clarification or add additional context in comments.

2 Comments

Wow this is great! The only thing I'm having difficulty with is my output gives me [(False, 'skadjflaskdjfasljdfalskdjfl\nsdkfjhasdlkgjhsdlkjghlaskdj\nasdhfk')... I can't seem to see why I get a boolean and why it is false?!
It looks like perhaps you are printing the key rather than meta? Are you using data.append((key,lines))? If so, change key --> meta.
1

I don't know whether this will be the fastest way, but from the top of my head:

data = []
with open('input.file', 'r') as fp:
    for line in fp:
        line = line.strip()
        if line[0] == '#':
            data.append((line, []))
        else:
            data[-1][1].append(line)
data = [(X, ''.join(Y)) for X, Y in data]

1 Comment

Thanks, this was a cool answer. I've never thought to do it this way.
0

I guess something like that:

result = []
for line in file.readlines():
    if line[0] == '#':
        result.append([line])
    else:
        if len(result[-1]) == 1:
            result[-1].append(line)
        else:
            result[-1][-1] += line

Not tested.

Comments

0

I'd keep it simple, something like:

data = [] # result
lastmeta = None # the last metadata line seen
chunks = [] # lines since the last metadata line
for line in input:
    if line[0] == '#': # metadata
        if lastmeta: # need to flush data we've collected
            data.append((lastmeta, ''.join(chunks))
        lastmeta = line
    else:
        chunks.append(line)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.