2

I am trying to extract some useful data from a large chunck of data given through regex.
Sample string:

test 1:
hello op1 yviphf
hello op2 vipqwe
test 2:
hello op3
hello op4 vipgt
hello op5 zcv

Above contains 2 test numbers but there are several. I want to extract op1, op2, op3, op4, op5 and also the corresponding test numbers. The number of op's in each test can vary.
Below is the pattern I tried writing but it does not help:

test\s(\d+).*?(?:hello\s+(\S+).*?\n)+

The output could be list of list. The main list would have the first element as the test number and the second element might be the list containing all the op's.

13
  • 1
    Do it in two steps: first match complete sections then for each section match the op values. Commented Jan 11, 2016 at 12:35
  • Do you NEED to use regex? Commented Jan 11, 2016 at 12:36
  • Are you looking for the /s flag? See regex101.com/r/nU8aA5/1 Commented Jan 11, 2016 at 12:38
  • 1
    Would this do? Commented Jan 11, 2016 at 12:51
  • 1
    You should give a better example string (more realistic) because it's hard to answer. (how really looks ops, does the word "hello" start each line?). if you have a lot of data, working line by line is better, and perhaps you can avoid the regex and obtain a faster result. Commented Jan 11, 2016 at 13:14

2 Answers 2

2

I suggest a 3-step approach based on regexes.

  • First, get all the test numbers with r'test\s*(\d+)' and re.findall (that will only fetch a list of numbers as the \d+ subpattern is inside a capturing group)
  • Second, split the input string with test\s*\d+ regex to obtain the subsections with hello codes and run the hello\s+(\S+) (or (?m)^hello\s*(\S+) if the hello starts at the line start) regex on each non-empty chunk (again, re.findall will only fetch the \S+ submatches as it is enclosed in a capture group)
  • Merge the lists into a list of tuples.

Python demo:

import re
test_str = "test 1:\nhello op1 yviphf\nhello op2 vipqwe\ntest 2:\nhello op3\nhello op4 vipgt\nhello op5 zcv"
res1 = [y for y in re.findall(r'test\s*(\d+)', test_str) if y]
res2 = [re.findall(r'(?m)^hello\s*(\S+)', b) for b in re.split(r'test\s*\d+', test_str) if b]
print(zip(res1, res2))

Result: [('1', ['op1', 'op2']), ('2', ['op3', 'op4', 'op5'])]

Sign up to request clarification or add additional context in comments.

Comments

1

Do you NEED to use REGEX?

If not, you could get away with loops, strin comparison and splits:

data = {}
key = '_'
for linea in text.split('\n'):
    if "test" in linea:
        key = linea.split()[1][:-1]
        data[key]=[]
    else:
        _data_ = linea.split()[1]
        data[key].append(_data_)

print data
> {'1': ['op1', 'op2'], '2': ['op3', 'op4', 'op5']}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.