Unable to find the correct regex in python

Question

I search for hours but i can't find the correct regulare expression to match a simple pattern. With this text (it's the stdout of a logical volume by Volume group's listing):

rootvg:
hd5                 boot       1     1     1    closed/syncd  N/A
hd4                 jfs        38    38    1    open/syncd    /
datavg:
data01lv            jfs        7     7     1    open/syncd    /data1
data02lv            jfs        7     7     1    open/syncd    /data2

I wish to find that kind of results from my regulare expression (with regex.findall(text), for exemple):

    [(u'rootvg', u'hd5 boot 1 1 1 closed/syncd N/A\nhd4 jfs 38 38 1 open/syncd /\n'),(u'datavg', u'data01lv jfs 7 7 1 open/syncd /data1\ndata02lv jfs 7 7 1 open/syncd /data2')]

But the best i can have is with this pattern:^(?P<vgname>\w+):\s(?P<lv>[\w\s\.\_\/-]+)+ results with findall:

[(u'rootvg', u'hd5 boot 1 1 1 closed/syncd N/A\nhd4 jfs 38 38 1 open/syncd /\ndatavg')]

Why don't you use split() here instead of tinkering with fragile regular expressions? — user2665694
– user2665694, Commented Dec 19, 2012 at 18:23

Andrew Clark · Accepted Answer · 2012-12-19 18:30:04Z

Try the following:

re.findall(r'^(\w+):(.*?)(?=^\w+:|\Z)', text, flags=re.DOTALL | re.MULTILINE)

Example:

>>> text = '''rootvg:
... hd5                 boot       1     1     1    closed/syncd  N/A
... hd4                 jfs        38    38    1    open/syncd    /
... datavg:
... data01lv            jfs        7     7     1    open/syncd    /data1
... data02lv            jfs        7     7     1    open/syncd    /data2'''
>>> re.findall(r'^(\w+):(.*?)(?=^\w+:|\Z)', text, flags=re.DOTALL | re.MULTILINE)
[('rootvg', '\nhd5                 boot       1     1     1    closed/syncd  N/A\nhd4                 jfs        38    38    1    open/syncd    /\n'), ('datavg', '\ndata01lv            jfs        7     7     1    open/syncd    /data1\ndata02lv            jfs        7     7     1    open/syncd    /data2')]

The re.DOTALL flag makes it so . can match line break characters, and the re.MULTILINE flags makes it so ^ and $ can match at the beginning and end of lines, respectively, instead of just the beginning and end of the string.

Explanation:

^            # match at the start of a line
(\w+)        # match one or more letters or numbers and capture in group 1
:            # match a literal ':'
(.*?)        # match zero or more characters, as few as possible
(?=          # start lookahead (only match if following regex can match)
   ^\w+:       # start of line followed by word characters then ':'
   |           # OR
   \Z          # end of the string
)            # end lookahead

Alternatively, you could use re.split() with a much simpler regex to get similar output, it shouldn't be too difficult to transform this into the format you need:

>>> re.split(r'^(\w+):', text, flags=re.MULTILINE)
['', 'rootvg', '\nhd5                 boot       1     1     1    closed/syncd  N/A\nhd4                 jfs        38    38    1    open/syncd    /\n', 'datavg', '\ndata01lv            jfs        7     7     1    open/syncd    /data1\ndata02lv            jfs        7     7     1    open/syncd    /data2']

Here is how you might turn this into your desired format:

>>> matches = re.split(r'^(\w+):', text, flags=re.MULTILINE)
>>> [(v, matches[i+1]) for i, v in enumerate(matches) if i % 2]
[('rootvg', '\nhd5                 boot       1     1     1    closed/syncd  N/A\nhd4                 jfs        38    38    1    open/syncd    /\n'), ('datavg', '\ndata01lv            jfs        7     7     1    open/syncd    /data1\ndata02lv            jfs        7     7     1    open/syncd    /data2')]

Kenan Banks · Accepted Answer · 2012-12-19 18:48:26Z

#!/usr/bin/env python

"""
    Demo code for Stackoverflow question:
    http://stackoverflow.com/questions/13958548/unable-to-find-the-correct-regex-in-python#13958634
"""

import StringIO

text = """
rootvg:
hd5                 boot       1     1     1    closed/syncd  N/A
hd4                 jfs        38    38    1    open/syncd    /
datavg:
data01lv            jfs        7     7     1    open/syncd    /data1
data02lv            jfs        7     7     1    open/syncd    /data2
"""


def gen_lines(text):    
    """ yield non-blank lines in input """
    for line in text:
        if line.strip():
            yield line

def gen_groups(text):
    group = None
    data = []
    for line in gen_lines(text):

        # We found a new group label
        if len(line.split()) == 1 and line.strip().endswith(':'):
            if group:
                yield group, data
            group = line.strip()[:-1]
            data = []

        # We found a data line
        elif group:
            data.append(line.split())

    # We're done with input; yield final group
    else:
        if group:
            yield group, data

def main():

    # Mimics behavior of mock_file = open('input.txt')
    mock_file = StringIO.StringIO(text)

    for group, data in gen_groups(mock_file):
        print group
        for d in data:
            print d

main()

And the output:

rootvg
['hd5', 'boot', '1', '1', '1', 'closed/syncd', 'N/A']
['hd4', 'jfs', '38', '38', '1', 'open/syncd', '/']
datavg
['data01lv', 'jfs', '7', '7', '1', 'open/syncd', '/data1']
['data02lv', 'jfs', '7', '7', '1', 'open/syncd', '/data2']

Collectives™ on Stack Overflow

Unable to find the correct regex in python

2 Answers 2

Comments

And the output:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

And the output:

Comments

Your Answer

Sign up or log in

Post as a guest

Related