0

I'd like to parse define statements in a PHP file using a Python regex. (Or in other words: I want to use Python to parse a PHP file.)

What I'd like to parse are define statements like this:

define("My_KEY", "My_Value l");
define('My_KEY', 'My_Value');
define(   'My_KEY'  ,    "My_Value"   );

So I came up with the following Python regex:

define\(\s*["']{1}(.[^'"]*)["']{1}\s*,\s*["']{1}(.[^'"]*)["']{1}\s*\)

This works great, as long as there is no use of a " or ' inside the define statement. For example something like this will not work:

define(   'My_KEY'  ,    'My\'_\'Value'   );
define(   'My_KEY'  ,    "My'_'Value"   );

Any ideas how to approach this problem?

5
  • 1
    Is regex necessary for the whole task? You could use regex to find define(..) then split the string between the parens, and trim it, etc., to get the values you need. Commented May 14, 2016 at 13:10
  • See stackoverflow.com/questions/1352693/… Commented May 14, 2016 at 13:27
  • 1
    @AndyG yes I could, but I want to learn more about how to use regex, so that why I came up with the question. Commented May 14, 2016 at 13:52
  • @Barmar thanks for the heads up Commented May 14, 2016 at 13:52
  • @manuel fair enough (from the answers you can see why I suggested taking two stages to do this ;)) Commented May 14, 2016 at 14:09

4 Answers 4

1

You can use something like:

import re
result = re.findall(r"""^define\(\s*['"]*(.*?)['"]*[\s,]+['"]*(.*?)['"]*\s*\)""", subject, re.IGNORECASE | re.DOTALL | re.MULTILINE)

Regex101 Demo and Explanation


Matches:

MATCH 1
1.  [8-14]  `My_KEY`
2.  [18-28] `My_Value l`
MATCH 2
1.  [40-46] `My_KEY`
2.  [50-58] `My_Value`
MATCH 3
1.  [73-79] `My_KEY`
2.  [88-96] `My_Value`
MATCH 4
1.  [114-120]   `My_KEY`
2.  [129-141]   `My\'_\'Value`
MATCH 5
1.  [159-165]   `My_KEY`
2.  [174-184]   `My'_'Value`
Sign up to request clarification or add additional context in comments.

2 Comments

this looks great. but how does it work? why doesn't it stop when it reaches a " or '?
I don't know, where do you mention or in your question?
1

Use look arounds with this monster regex:

define\(\s*(["'])(?P<key>.+?(?=\1))\1\s*,
\s*(["'])(?P<value>.+?)(?=\3)(?<!\\)\3

See a demo on regex101.com.

Comments

0

In python,

str="define(   'My_KEY'  ,    'My\'_\'Value'   )";
import re
re.sub(r"""^define\(\s*['"]*(.*?)['"]*[\s,]+['"]*(.*?)['"]*\s*\)""",r'\2 ; \1', str)

Output :

"My'_'Value ; My_KEY"

Comments

0

Description

This regex will do the following:

  • match all lines that start with define and have a key and value set inside parentheses
  • capture the key and value strings, without including the wrapping quotes
  • all key and value to be wrapped in single or double quotes
  • correctly handle escaped quotes
  • avoid difficult edge cases like:
    • define( 'file path', "C:\\windows\\temp\\" ); where an escaped slash may exist before a closing quote

The Regex

Note: using the following flags: case-insensitive, global, multiline

^define\(\s*(['"])((?:\\\1|(?:(?!\1).))*)\1\s*,\s*(['"])((?:\\\3|(?:(?!\3).))*)\3\s*\);

Regular expression visualization

Capture groups

  • capture group 0 gets the entire string
  • capture group 1 gets the quote type surrounding the key
  • capture group 2 gets the key string inside the quotes
  • capture group 3 gets the quote type surrounding the value
  • capture group 4 gets the value string inside the quotes

Examples

Live Demo

https://regex101.com/r/oP4sV0/1

Sample Text

define("0 My_KEY", "0 My_Value l");
define('1 My_KEY', '1 My_Value');
define(   '2 My_KEY'  ,    "2 My_Value"   );
define(   '3 My_KEY\\'  ,    '3 My\'_\'Value'   );
define(   '4 My_KEY'  ,    "4 My'_'Value\\"   );

Sample Matches

[0][0] = define("0 My_KEY", "0 My_Value l");
[0][1] = "
[0][2] = 0 My_KEY
[0][3] = "
[0][4] = 0 My_Value l

[1][0] = define('1 My_KEY', '1 My_Value');
[1][1] = '
[1][2] = 1 My_KEY
[1][3] = '
[1][4] = 1 My_Value

[2][0] = define(   '2 My_KEY'  ,    "2 My_Value"   );
[2][1] = '
[2][2] = 2 My_KEY
[2][3] = "
[2][4] = 2 My_Value

[3][0] = define(   '3 My_KEY'  ,    '3 My\'_\'Value'   );
[3][1] = '
[3][2] = 3 My_KEY\\
[3][3] = '
[3][4] = 3 My\'_\'Value

[4][0] = define(   '4 My_KEY'  ,    "4 My'_'Value"   );
[4][1] = '
[4][2] = 4 My_KEY
[4][3] = "
[4][4] = 4 My'_'Value\\

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of a "line"
----------------------------------------------------------------------
  define                   'define'
----------------------------------------------------------------------
  \(                       '('
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    ['"]                     any character of: ''', '"'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      \1                       what was matched by capture \1
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
          \1                       what was matched by capture \1
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
        .                        any character except \n
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  \1                       what was matched by capture \1
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  ,                        ','
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    ['"]                     any character of: ''', '"'
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      \3                       what was matched by capture \3
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
          \3                       what was matched by capture \3
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
        .                        any character except \n
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  \3                       what was matched by capture \3
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  \)                       ')'
----------------------------------------------------------------------
  ;                        ';'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.