1

I am new to Python programming.

My task is the following:

I have a HUGE txt file (20+GB) with a lot of data. The structure is this:

Crap
Crap
Crap
...
Crap
Crap
Useful Data = x y z
Useful Data 2 = x2 y2 z2
Crap
Crap
...
Crap
Crap
Useful Data = x' y' z'
Useful Data 2 = x2' y2' z2'
Crap
Crap...

And so on like this for 5000 objects

I have to take every x, y and z and put them in a file which should look like the following:

x y z x2 y2 z2
x' y' z' x2' y2' z2'
x'' y'' z'' x2'' y2'' z2''

......and so on (i should have 5000 rows).

I thought regular expressions would have been good for this task. I've written this but i'm a real noob and can't go on:

f_in_name="starout.txt"  #input file
f_out_name="cmposvel"    #output file
f_in = open(f_in_name)
for l in f_in:
    if "system_time" in l:
        time=re.compile('^  system_time  =\s+(\S+)')
    elif "com_pos" in l:
        poscm=re.compile('^  com_pos =\s+(\S+)\s+(\S+)\s+(\S+)')
    elif "com_vel" in l:    
        velcm=re.compile('^  com_vel =\s+(\S+)\s+(\S+)\s+(\S+)')
        #how do I write t,x,y,z,vx,vy,vz in the output?

How do I write the (\S+) on the output? Also, does re.compile search only in the current line or in the whole document? I'm confused, Is someone able to help me? I really need this to make a plot and have no clues about how doing that.

1 Answer 1

1

re.compile only prepares a regular expression for use - something you'd do outside your loop. It is not the application of it. "re.search" or "re.match" are methods of a compiled expression to use. You will then get back matches (or None if they don't), and these contain groups of your data.

You can extract the groups to get to the useful stuff. For example:

my_re = re.compile("stuff=\s+(\S+)\s+(\S+)")
line = "stuff= foo bar"
matches = my_re.search(line)
if matches:
  print(matches.groups())
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot man, how can I deal with the fact that there are 5000 expressions which match in my file? For example, there are 5000 lines which have the format '^ system_time =\s+(\S+)', so how can I write them in the right order in my output?
Loop through them, and do the re.search in the loop (as you are in your sample code).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.