Python multiline regex substitution

Question

I feel bad for asking yet another regex question, but this has been driving me crazy for the past week.

I am trying to use regular expressions in python to replace some text that looks like this:

text = """some stuff
line with text
other stuff
[code language='cpp']
#include <cstdio>

int main() {
    printf("Hello");
}
[/code]
Maybe some
other text"""

What I want to do is capture the text inside the [code] tags, add a tab (\t) in front of each line and then replace all the [code]...[/code] by this new lines with the tabs prepended. That is, I want the result to look like:

"""some stuff
line with text
other stuff

    #include <cstdio>

    int main() {
        printf("Hello");
    }

Maybe some
other text"""

I am using the following snippet.

class CodeParser(object):
    """Parse a blog post and turn it into markdown."""

    def __init__(self):
        self.regex = re.compile('.*\[code.*?\](?P<code>.*)\[/code\].*',
                                re.DOTALL)

    def parse_code(self, text):
        """Parses code section from a wp post into markdown."""
        code = self.regex.match(text).group('code')
        code = ['\t%s' % s for s in code.split('\n')]
        code = '\n'.join(code)
        return self.regex.sub('\n%s\n' % code, text)

The problem with this is that it matches all the characters before and after the code tags because of the initial and final .* and when I perform the replacement, these are removed. If I remove the .*s, the re does not match anything anymore.

I thought this could be a problem with newlines, so I tried replacing all the '\n' with, say, '¬', performing the matching, and then changing the '¬' back to '\n', but I didn't have any luck with this approach.

If anyone has a better method of accomplishing what I want to accomplish, I am open to suggestions.

Thank you.

gymbrall · Accepted Answer · 2015-07-11 20:36:50Z

1

You're on the right track. Instead of regex.match, use regex.search. That way you can get rid of the leading and trailing .*s.

Try this:
    def __init__(self):
        self.regex = re.compile('\[code.*?\](?P<code>.*)\[/code\]',
                                re.DOTALL)


    def parse_code(self, text):
        """Parses code section from a wp post into markdown."""
        # Here we are using search which finds the pattern anywhere in the 
        # string rather than just at the beginning
        code = self.regex.search(text).group('code')
        code = ['\t%s' % s for s in code.split('\n')]
        code = '\n'.join(code)

        return self.regex.sub('\n%s\n' % code, text)

answered Jul 11, 2015 at 20:36

gymbrall

2,09319 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Andrés Over a year ago

Thank you! I should have kept reading on the docs a little bit further down... It's right there

Collectives™ on Stack Overflow

Python multiline regex substitution

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related