Extracting data from text file with regex

Question

Trying to extract three lists of data from a txt file using regex

File structure = metadata, values (repeat)

#
#text
#text
#
9.2318434E-5 -1.3870514E-9 1.0E-4 7.0E-5 9.2318434E-5 9.225606E-5 9.225606E-5 2.5E-4 2.5E-4
9.230842E-5 -1.3756367E-9 1.0E-4 7.0E-5 9.230842E-5 9.225539E-5 9.225539E-5 0.00225 0.00225
9.230592E-5 -1.3935526E-9 1.0E-4 7.0E-5 9.230592E-5 9.2255046E-5 9.2255046E-5 0.00275 0.00275

#
#text
#text
#
9.2318434E-5 -1.3870514E-9 1.0E-4 7.0E-5 9.2318434E-5 9.225606E-5 9.225606E-5 2.5E-4 2.5E-4
9.231593E-5 -1.3816212E-9 1.0E-4 7.0E-5 9.231593E-5 9.225253E-5 9.225253E-5 7.5E-4 7.5E-4
9.230592E-5 -1.3935526E-9 1.0E-4 7.0E-5 9.230592E-5 9.2255046E-5 9.2255046E-5 0.00275 0.00275

#
#text
#text
#
9.2318434E-5 -1.3870514E-9 1.0E-4 7.0E-5 9.2318434E-5 9.225606E-5 9.225606E-5 2.5E-4 2.5E-4
9.231593E-5 -1.3816212E-9 1.0E-4 7.0E-5 9.231593E-5 9.225253E-5 9.225253E-5 7.5E-4 7.5E-4
9.231343E-5 -1.3962527E-9 1.0E-4 7.0E-5 9.231343E-5 9.225581E-5 9.225581E-5 0.00125 0.00125

I've been trying the following

with open(file) as newfile:
    data = re.findall(r'^([#][\n][0-9])[\s\S]*([\n][\n])$', newfile.read())

Each block of data starts with #\n[0-9] and ends with \n\n and then I need to take every character between the start and end hence [\s\S]*. Doesn't seem to be working any help would be great.

i dont understant if you want the full 3 lines or every number in these 3 lines? — Frenchy
– Frenchy, Commented Apr 3, 2019 at 16:09
I expect the output to be a list with three elements containing a string with all the number in that section i.e. mylist = ["numbers\n numbers\n numbers\n", "numbers\n numbers\n numbers\n", "numbers\n numbers\n numbers\n"] — Ryan
– Ryan, Commented Apr 3, 2019 at 16:13

ctwheels · Accepted Answer · 2019-04-03 16:16:31Z

2

A side note, you don't need to encase everything in [].

See regex in use here.

^(?<=#\n)\d[^#]*$

^ assert position at the start of the line
(?<=#\n) positive lookbehind ensuring what precedes matches # followed by a newline character \n
\d match a digit
[^#]* match any character except # any number of times (greedy, so it will try to match as many characters as possible - until it reaches another #)
$ assert position at the end of the line

Alternatively, and very simply, you could probably use ^\d.* as seen here.

^ assert position at the start of the line
\d match a digit
.* match any character (except for line terminators) any number of times

edited Apr 3, 2019 at 16:16

answered Apr 3, 2019 at 16:11

ctwheels

23k9 gold badges47 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Pedro Lobito · Accepted Answer · 2019-04-03 16:25:38Z

You can use:

import re
with open("file.txt") as f:
    t = f.read().strip()

lists = []
m = re.findall(r"^[\d.E\s-]+$", t, re.MULTILINE) # 45 steps
for x in m:
    a = [float(x) for x in " ".join(x.strip().split("\n")).split()]
    lists.append(a)

print(lists)

Output:

[[9.2318434e-05, -1.3870514e-09, 0.0001, 7e-05, 9.2318434e-05, 9.225606e-05, 9.225606e-05, 0.00025, 0.00025, 9.230842e-05, -1.3756367e-09, 0.0001, 7e-05, 9.230842e-05, 9.225539e-05, 9.225539e-05, 0.00225, 0.00225, 9.230592e-05, -1.3935526e-09, 0.0001, 7e-05, 9.230592e-05, 9.2255046e-05, 9.2255046e-05, 0.00275, 0.00275], [9.2318434e-05, -1.3870514e-09, 0.0001, 7e-05, 9.2318434e-05, 9.225606e-05, 9.225606e-05, 0.00025, 0.00025, 9.231593e-05, -1.3816212e-09, 0.0001, 7e-05, 9.231593e-05, 9.225253e-05, 9.225253e-05, 0.00075, 0.00075, 9.230592e-05, -1.3935526e-09, 0.0001, 7e-05, 9.230592e-05, 9.2255046e-05, 9.2255046e-05, 0.00275, 0.00275], [9.2318434e-05, -1.3870514e-09, 0.0001, 7e-05, 9.2318434e-05, 9.225606e-05, 9.225606e-05, 0.00025, 0.00025, 9.231593e-05, -1.3816212e-09, 0.0001, 7e-05, 9.231593e-05, 9.225253e-05, 9.225253e-05, 0.00075, 0.00075, 9.231343e-05, -1.3962527e-09, 0.0001, 7e-05, 9.231343e-05, 9.225581e-05, 9.225581e-05, 0.00125, 0.00125]]

Demo:

Andry · Accepted Answer · 2019-04-03 16:31:01Z

0

You can also solve this problem without using regex at all if you wish. As you only want to read lines which do not start with symbol #, you can just read lines from file and check whether they start with # or not. Then strip the line and split it from spacing to get all the numbers as string.

Here is an example using list comprehension:

numbers = []
with open(file) as newfile:
    numbers += [number for line in newfile.readlines() if not line.startswith('#') for number in line.strip().split()]
newfile.close()
print(numbers) # list of all the numbers as strings

answered Apr 3, 2019 at 16:31

Andry

3671 gold badge8 silver badges25 bronze badges

Collectives™ on Stack Overflow

Extracting data from text file with regex

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related