Extracting repetitions through regex in Python

Question

I am trying to extract some useful data from a large chunck of data given through regex.
Sample string:

test 1:
hello op1 yviphf
hello op2 vipqwe
test 2:
hello op3
hello op4 vipgt
hello op5 zcv

Above contains 2 test numbers but there are several. I want to extract op1, op2, op3, op4, op5 and also the corresponding test numbers. The number of op's in each test can vary.
Below is the pattern I tried writing but it does not help:

test\s(\d+).*?(?:hello\s+(\S+).*?\n)+

The output could be list of list. The main list would have the first element as the test number and the second element might be the list containing all the op's.

Do it in two steps: first match complete sections then for each section match the op values. — HamZa
– HamZa, Commented Jan 11, 2016 at 12:35
Are you looking for the /s flag? See regex101.com/r/nU8aA5/1 — Jan
– Jan, Commented Jan 11, 2016 at 12:38
You should give a better example string (more realistic) because it's hard to answer. (how really looks ops, does the word "hello" start each line?). if you have a lot of data, working line by line is better, and perhaps you can avoid the regex and obtain a faster result. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Jan 11, 2016 at 13:14

Wiktor Stribiżew · Accepted Answer · 2016-01-11 13:23:25Z

2

I suggest a 3-step approach based on regexes.

First, get all the test numbers with r'test\s*(\d+)' and re.findall (that will only fetch a list of numbers as the \d+ subpattern is inside a capturing group)
Second, split the input string with test\s*\d+ regex to obtain the subsections with hello codes and run the hello\s+(\S+) (or (?m)^hello\s*(\S+) if the hello starts at the line start) regex on each non-empty chunk (again, re.findall will only fetch the \S+ submatches as it is enclosed in a capture group)
Merge the lists into a list of tuples.

Python demo:

import re
test_str = "test 1:\nhello op1 yviphf\nhello op2 vipqwe\ntest 2:\nhello op3\nhello op4 vipgt\nhello op5 zcv"
res1 = [y for y in re.findall(r'test\s*(\d+)', test_str) if y]
res2 = [re.findall(r'(?m)^hello\s*(\S+)', b) for b in re.split(r'test\s*\d+', test_str) if b]
print(zip(res1, res2))

Result: [('1', ['op1', 'op2']), ('2', ['op3', 'op4', 'op5'])]

edited Jan 11, 2016 at 13:23

answered Jan 11, 2016 at 13:13

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tglaria · Accepted Answer · 2016-01-11 13:30:03Z

1

Do you NEED to use REGEX?

If not, you could get away with loops, strin comparison and splits:

data = {}
key = '_'
for linea in text.split('\n'):
    if "test" in linea:
        key = linea.split()[1][:-1]
        data[key]=[]
    else:
        _data_ = linea.split()[1]
        data[key].append(_data_)

print data
> {'1': ['op1', 'op2'], '2': ['op3', 'op4', 'op5']}

answered Jan 11, 2016 at 13:30

tglaria

5,8862 gold badges15 silver badges17 bronze badges

Collectives™ on Stack Overflow

Extracting repetitions through regex in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related