How to read a large file block by block and judge by block header?

Question

I have a large file which I want to read block by block by matching the headers. For example, the file is like this:

@header1
a b c 1 2 3
c d e 2 3 4
q w e 3 4 5


@header2
e 89 78 56
s 68 77 26
...

I wrote a script like this:

with open("filename") as f:
  line=f.readline()
  if line.split()[0]=="@header1":
     list1.append(f.readline().split()[0])
     list2.append(f.readline().split()[1])
     ...
  elif line.split()[0]=="@header2":
     list6.append(f.readline().split()[0])
     list7.append(f.readline().split()[1])
     ...

But it seems to only read the first header and did not read-in the second block. Also, there are some empty lines in between those blocks. How to read the block when the line matches certain strings and skip those empty lines.

I know in C, it would be switch. How to do the similar thing in python?

You need to add more details. Are these multiple space-seprated file-segments inside one file? Are the @header... guaranteed to be numbered sequentially and contiguously? If the @header1 occurs all on its own, why do you test line.split()[0]=="@header2" rather than simply line == "@header2"? or just line.startswith('@header') , which should capture them all, and doesn't even need a regex ? — smci
– smci, Commented Nov 28, 2018 at 0:30
Ultimately I expect you want to read the space-separated rows contents (within each section, according to its header), so you'll want to wrap a reader object. Or write a generator to yield each chunk of rows separately, so you can then pass it into a reader object. — smci
– smci, Commented Nov 28, 2018 at 1:17
"Also, there are some empty lines in between those blocks." So, you're guaranteed that empty lines can only occur outside section, not inside them? — smci
– smci, Commented Nov 28, 2018 at 1:18

SpghttCd · Accepted Answer · 2018-11-28 07:48:00Z

1

IMO, your misconception is about how csv-files can be read. At least I doubt that ´switch´ from C would help here more than what can be done with if-clauses.

However, please understand, that you have to iterate through your file line by line. That is, there is nothing which can deal with whole blocks, if you do not know the length before.

So your algorithm is sth like:

for every line in the file:
. .is header?
. . .then prepare for this specific header
. .is empty line?
. . .then skip
. .is data?
. . .then append according to preparation above

In code this could be sth like

block_ctr = -1
block_data = []
with open(filename) as f:
    for line in f:                   
        if line:                         # test if line is not empty
            if line.startswith('@header'):
                block_ctr += 1
                block_data.append([])
            else:
                block_data[block_ctr].append(line.split())

edited Nov 28, 2018 at 7:48

answered Nov 28, 2018 at 0:34

SpghttCd

10.9k2 gold badges23 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

smci Over a year ago

It lends itself to a generator approach, see my answer

Akarius · Accepted Answer · 2018-11-28 00:33:42Z

0

I don't know what you want to achieve exactly but maybe something like this:

with open(filename) as f:
    for line in f:
        if line.startswith('@'):
            print('header')
            # do something with header here
        else:
            print('regular line')
            # do something with the line here

answered Nov 28, 2018 at 0:33

Akarius

3981 silver badge9 bronze badges

Comments

smci · Accepted Answer · 2018-11-28 12:46:18Z

Attached at bottom is a solution using a Python generator split_into_chunks(f) to extract each section (as list-of-string), squelch empty lines, detect missing @headers and EOF. The generator approach is really neat because it allows you to further wrap e.g. a CSV reader object which handles space-separated value (e.g. pandas read_csv):

with open('your.ssv') as f:
    for chunk in split_into_chunks(f):
        # Do stuff on chunk. Presumably, wrap a reader e.g. pandas read_csv
        # print(chunk)

Code is below. I also parameterized the value demarcator='@header' for you. Note that we have to iterate with line = inputstream.readline(), while line, instead of the usual iterating with for line in f, since if we see the @header of the next section, we need to pushback with seek/tell() ; see this and this for explanation why. And if you want to modify the generator to yield the chunk header and body separately (e.g. as a list of two items), that's trivial.

def split_into_chunks(inputstream, demarcator='@header'):
    """Utility generator to get sections from file, demarcated by '@header'"""

    while True:
        chunk = []

        line = inputstream.readline()
        # At EOF?
        if not line: break
        # Expect that each chunk starts with one header line
        if not line.startswith(demarcator):
            raise RuntimeError(f"Bad chunk, missing {demarcator}")

        chunk.append(line.rstrip('\n'))

        # Can't use `for line in inputstream:` since we may need to pushback
        while line:
            # Remember our file-pointer position in case we need to pushback a header row
            last_pos = inputstream.tell()
            line = inputstream.readline()

            # Saw next chunk's header line? Pushback the header line, then yield the current chunk
            if line.startswith(demarcator):
                inputstream.seek(last_pos)
                break

            # Ignore blank or whitespace-only lines
            #line = line.rstrip('\n')
            if line:
                chunk.append(line.rstrip('\n'))

        yield chunk


with open('your.ssv') as f:
    for chunk in split_into_chunks(f):
        # Do stuff on chunk. Presumably, wrap it with a reader which handles space-sparated value, e.g. pandas read_csv
        print(chunk)

fish_bu · Accepted Answer · 2018-11-29 19:53:21Z

0

I saw another post similar to this question and copied the idea here. I agree that SpghttCd is right although I have not tried that.

    with open(filename) as f:
        #find each line number that contains header
        for i,line in enumerate(f,1):
            if 'some_header' in line:
                index1=i
            elif 'another_header' in line:
                index2=i
            ...
    with open(filename) as f:
        #read the first block:
        for i in range(int(index1)):
            line=f.readline()
        for i in range('the block size'):
            'read, split and store'
        f.seek(0)
        #read the second block, third and ... 
        ...

answered Nov 29, 2018 at 19:53

fish_bu

251 silver badge4 bronze badges

Collectives™ on Stack Overflow

How to read a large file block by block and judge by block header?

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related