2

I have data files that contain data for many timesteps, with each timestep formatted in a block like this:

TIMESTEP  PARTICLES
0.00500103 1262
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....

Each block consists of the 3 header lines and a number of lines of data related to the timestep (int on line 2). The number of lines of data associated with the block can vary from 0 to 10 Million. Each block may have a blank line between them, but sometimes this is missing.

I want to be able to read the file block by block, processing the data after reading the block - the files are large (often over 200GB) and one timestep is about all that can be comfortably loaded into memory.

Because of the file format I thought it would be quite easy to write a function that reads the 3 header lines, reads the actual data and then return a nice numpy array for data processing. I'm used to MATLAB where you can simply read in blocks while not at the end of file. I'm not quite sure how to do this with python.

I created the following function to read the block of data:

def readBlock(f):
    particleData = []
    Timestep = []
    numParticles = []
    linesProcessed = 0

    line = f.readline().strip()
    if line.startswith('TIMESTEP'): 

        timestepHeaders = line.strip()
        varData = f.readline().strip()
        headerStrings = f.readline().strip().split(' ')
        parts = varData.strip().split(' ')
        Timestep = float(parts[0])
        numParticles = int(parts[1])
        while linesProcessed < numParticles:
            particleData.append(tuple(f.readline().strip().split(' ')))
            linesProcessed += 1

        mydt = np.dtype([ ('ID',int), 
                     ('GROUP', int),
                     ('Vol', float),
                     ('Mass', float),
                     ('Px', float),
                     ('Py', float),
                     ('Pz', float),
                     ('Vx', float),
                     ('Vy', float),
                     ('Vz', float),
                     ] )

        particleData = np.array(particleData, dtype=mydt)

    return Timestep, numParticles, particleData

I try to run the function like this:

with open(fileOpenPath, 'r') as file:
    startWallTime = time.clock()

    Timestep, numParticles, particleData = readBlock(file)
    print(Timestep)

    ## Do processing stuff here 
    print("Timestep Processed")

    endWallTime = time.clock()

The problem is this only reads the first block of data from the file and stops there - I don't know how to make it loop through the file until it hits the end and stops.

Any suggestions on how to make this work would be great. I think I can write a way of doing it using single line processing with lots of if checks to see if i'm at the end of the timestep, but the simple function seemed easier and clearer.

8
  • 1
    I still don't get what's your problem look like. In this code you only read one block. What happens when you try to read the next one? Also, I think that pandas.read_csv(f, num_rows=X, sep=' ') would make this function much better Commented Dec 11, 2016 at 22:20
  • That's the problem - it only reads one block - there are hundreds of timesteps in the file and I want it to keep returning one block until it reaches the end of the file. Commented Dec 11, 2016 at 22:36
  • what happens if you call readBlock on the same file again? Commented Dec 11, 2016 at 22:41
  • Looks like you are using Python 3. Is that correct? Commented Dec 11, 2016 at 22:43
  • @marat as long as I account for the possible empty line, it reads the next timestep - as long as I didn't close the file. Commented Dec 11, 2016 at 22:47

3 Answers 3

2

You can use the max_rows argument of numpy.genfromtxt:

with open("timesteps.dat", "rb") as f:
    while True:
        line = f.readline()
        if len(line) == 0:
            # End of file
            break
        # Skip blank lines
        while len(line.strip()) == 0:
            line = f.readline()
        line2_fields = f.readline().split()
        timestep = float(line2_fields[0])
        particles = int(line2_fields[1])
        data = np.genfromtxt(f, names=True, dtype=None, max_rows=particles)

        print("Timestep:", timestep)
        print("Particles:", particles)
        print("Data:")
        print(data)
        print()

Here's a sample file:

TIMESTEP  PARTICLES
0.00500103    4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103    5
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

TIMESTEP  PARTICLES
0.00500103    3
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

And here is the output:

Timestep: 0.00500103
Particles: 4
Data:
[ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
 (430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

Timestep: 0.00500103
Particles: 5
Data:
[ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)
 (652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
 (431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

Timestep: 0.00500103
Particles: 3
Data:
[ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. This is working almost perfectly now..... It's not dealing with timesteps where the number of particles, and hence rows, is zero. genfromtxt doesn't like max particles=0 so I stuck it in a "if greater than zero" check but that doesn't seem to help - it may have broken the empty line check.
0

The with does not loop, it will just make sure the file is properly closed afterwards.

To loop you'll need to add a while just after the with statement (see the code below). But before you can do that you'll need to check in the readBlock(f) function for an end of file (EOF). Replace line = f.readline().strip() with this code:

line = f.readline()
if not line:
    # EOF: returning None's.
    return None, None, None
# We do the strip after the check.
# Otherwise a blank line "\n" might be interpreted as EOF.
line = line.strip()

So adding the while loop in the with block and checking if we get None back indicating an EOF and so we can break out of the while loop:

with open('file1') as file_handle:
    while True:
        startWallTime = time.clock()

        Timestep, numParticles, particleData = readBlock(file_handle)
        if Timestep == None:
            break
        print(Timestep)

        ## Do processing stuff here 
        print("Timestep Processed")

        endWallTime = time.clock()

1 Comment

This seems to works reasonably well except for two things - the return None, None, None make the output quite confusing when there is a blank line between every block, but my main problem is that the values returned from the function disappear once the 'while True' loop ends
0

Here'a quick-n-dirty test (it worked on the 2nd try!)

import numpy as np

with open('stack41091659.txt','rb') as f:
    while f.readline():    # read the 'TIMESTEP  PARTICLES' line
        time, n = f.readline().strip().split()
        n = int(n)
        print(time, n)
        ablock = [f.readline()]  # block header line
        for i in range(n):
            ablock.append(f.readline())
        print(len(ablock))
        data = np.genfromtxt(ablock, dtype=None, names=True)
        print(data.shape, data.dtype)

test run:

1458:~/mypy$ python3 stack41091659.py 
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 3
4
(3,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 2
3
(2,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]

Sample file:

TIMESTEP  PARTICLES
0.00500103 4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103 3
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103 2
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103 4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

I'm using the fact that genfromtxt is happy with anything that feeds it a block of lines. Here I collect the next block in a list, and pass it to genfromtxt.

And using the max_rows parameter of genfromtxt, I can tell it to read the next n rows directly:

with open('stack41091659.txt','rb') as f:
    while f.readline():
        time, n = f.readline().strip().split()
        n = int(n)
        print(time, n)
        data = np.genfromtxt(f, dtype=None, names=True, max_rows=n)
        print(data.shape, len(data.dtype.names))

I'm not taking into account that optional blank line. Probably could squeeze that in at the start of the block read. I.e. Readlines until I get one with the valid float int pair of strings.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.