2

I feel bad for asking yet another regex question, but this has been driving me crazy for the past week.

I am trying to use regular expressions in python to replace some text that looks like this:

text = """some stuff
line with text
other stuff
[code language='cpp']
#include <cstdio>

int main() {
    printf("Hello");
}
[/code]
Maybe some
other text"""

What I want to do is capture the text inside the [code] tags, add a tab (\t) in front of each line and then replace all the [code]...[/code] by this new lines with the tabs prepended. That is, I want the result to look like:

"""some stuff
line with text
other stuff

    #include <cstdio>

    int main() {
        printf("Hello");
    }

Maybe some
other text"""

I am using the following snippet.

class CodeParser(object):
    """Parse a blog post and turn it into markdown."""

    def __init__(self):
        self.regex = re.compile('.*\[code.*?\](?P<code>.*)\[/code\].*',
                                re.DOTALL)

    def parse_code(self, text):
        """Parses code section from a wp post into markdown."""
        code = self.regex.match(text).group('code')
        code = ['\t%s' % s for s in code.split('\n')]
        code = '\n'.join(code)
        return self.regex.sub('\n%s\n' % code, text)

The problem with this is that it matches all the characters before and after the code tags because of the initial and final .* and when I perform the replacement, these are removed. If I remove the .*s, the re does not match anything anymore.

I thought this could be a problem with newlines, so I tried replacing all the '\n' with, say, '¬', performing the matching, and then changing the '¬' back to '\n', but I didn't have any luck with this approach.

If anyone has a better method of accomplishing what I want to accomplish, I am open to suggestions.

Thank you.

1 Answer 1

1

You're on the right track. Instead of regex.match, use regex.search. That way you can get rid of the leading and trailing .*s.

Try this:
    def __init__(self):
        self.regex = re.compile('\[code.*?\](?P<code>.*)\[/code\]',
                                re.DOTALL)


    def parse_code(self, text):
        """Parses code section from a wp post into markdown."""
        # Here we are using search which finds the pattern anywhere in the 
        # string rather than just at the beginning
        code = self.regex.search(text).group('code')
        code = ['\t%s' % s for s in code.split('\n')]
        code = '\n'.join(code)

        return self.regex.sub('\n%s\n' % code, text)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! I should have kept reading on the docs a little bit further down... It's right there

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.