0

Attempting to port python code to php and I can't seem to convert some regex to php equivalent.

RE_ON_DATE_SMB_WROTE = re.compile(
    u'(-*[>]?[ ]?({0})[ ].*({1})(.*\n){{0,2}}.*({2}):?-*)'.format(
        # Beginning of the line
        u'|'.join((
            # English
            'On',
            # French
            'Le',
            # Polish
            'W dniu',
            # Dutch
            'Op',
            # German
            'Am',
            # Portuguese
            'Em',
            # Norwegian
            u'På',
            # Swedish, Danish
            'Den',
            # Vietnamese
            u'Vào',
        )),
        # Date and sender separator
        u'|'.join((
            ',',
            u'użytkownik'
        )),
        # Ending of the line
        u'|'.join((
            # English
            'wrote', 'sent',
            # French
            u'a écrit',
            # Polish
            u'napisał',
            # Dutch
            'schreef','verzond','geschreven',
            # German
            'schrieb',
            # Portuguese
            'escreveu',
            # Norwegian, Swedish
            'skrev',
            # Vietnamese
            u'đã viết',
        ))
    ))
RE_QUOTATION = re.compile(
    r"""
    (
        (?:
            s
            |
            (?:me*){2,}
        )

        .*

        me*
    )

    [te]*$
    """, re.VERBOSE)
RE_EMPTY_QUOTATION = re.compile(
    r"""
    (
        (?:
            (?:se*)+
            |
            (?:me*){2,}
        )
    )
    e*
    """, re.VERBOSE)

Below is my attempt on the first regex for php (but it's failing on same test string)

$RE_ON_DATE_SMB_WROTE = sprintf("#(-*[>]?[ ]?(%s)[ ].*(%s)(.*\n){{0,2}}.*(%s):?-*)#u",
                            join('|', array(
                                // English
                                'On',
                                // French
                                'Le',
                                // Polish
                                'W dniu',
                                // Dutch
                                'Op',
                                // German
                                'Am',
                                // Portuguese
                                'Em',
                                // Norwegian
                                "\p{P}\p{å}",
                                // Swedish, Danish
                                'Den',
                                // Vietnamese
                                "Vào",
                            )),
                            join('|',array(
                                ',',
                                "użytkownik"
                            )),
                            join('|',array(
                                //# English
                                'wrote', 
                                'sent',
                                //# French
                                "a écrit",
                                //# Polish
                                "napisał",
                                //# Dutch
                                'schreef','verzond','geschreven',
                                //# German
                                'schrieb',
                                //# Portuguese
                                'escreveu',
                                //# Norwegian, Swedish
                                'skrev',
                                //# Vietnamese
                                "đã viết",
                            ))
                        );  

Using php regex as $result = preg_match($RE_ON_DATE_SMB_WROTE , $test_value, $found_match);

Last two regex. I can't even seem to wrap my head around.

Hopefully, someone more versed than me in both python and php can give me a hand here. :)

0

1 Answer 1

1

You define the regex using format strings in both Python and PHP, but they support different syntax. In Python, {} is used to insert a variable into the format string, while in PHP, you use the %s to insert the string variable. Hence, { and } are special in Python format strings and need doubling when you want to insert a literal brace char. No such doubling is required in PHP.

Also, you have "\p{P}\p{å}", in PHP regex declaration while in Python you just have u'På',. I guess you want to keep the Python pattern as is.

So, here is the pattern that will work the same in Python and PHP:

$RE_ON_DATE_SMB_WROTE = sprintf("#(-*>? ?(%s) .*(%s)(.*\\n){0,2}.*(%s):?-*)#u",
                            implode('|', array(
                                // English
                                'On',
                                // French
                                'Le',
                                // Polish
                                'W dniu',
                                // Dutch
                                'Op',
                                // German
                                'Am',
                                // Portuguese
                                'Em',
                                // Norwegian
                                "På",
                                // Swedish, Danish
                                'Den',
                                // Vietnamese
                                "Vào",
                            )),
                            implode('|',array(
                                ',',
                                "użytkownik"
                            )),
                            implode('|',array(
                                //# English
                                'wrote', 
                                'sent',
                                //# French
                                "a écrit",
                                //# Polish
                                "napisał",
                                //# Dutch
                                'schreef','verzond','geschreven',
                                //# German
                                'schrieb',
                                //# Portuguese
                                'escreveu',
                                //# Norwegian, Swedish
                                'skrev',
                                //# Vietnamese
                                "đã viết",
                            ))
                        );

join is an alias of implode, I prefer implode in this context.

Note that [ ] is the same as here, [>] = >, and "\n" (string escape sequence matching an LF char) = "\\n" (a regex escape sequence matching an LF char).

Note that if you want to port the re.VERBOSE flag to PHP, you will need to use the x flag, and then you cannot use a literal whitespace inside the pattern, you will need to escape the literal whitespace, or put it into character class (yes, [ ] will make sense then).

The last two regexes do not need any special conversion, and can be written as

$RE_QUOTATION = '~((?:s|(?:me*){2,}).*me*)[te]*$~';
$RE_EMPTY_QUOTATION = '~((?:(?:se*)+|(?:me*){2,}))e*~';
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot for the regex Wiktor. It definitely moved the wheel further. If you can create one for second regex I'll mark this as answer. :) Oh and I opened a new one related to this, in case you want to take a look at it. stackoverflow.com/questions/71373944/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.