Convert ruby regular expression definition to python regex

Question

I've following regexes defined for capturing the gem names in a Gemfile.

GEM_NAME = /[a-zA-Z0-9\-_\.]+/

QUOTED_GEM_NAME = /(?:(?<gq>["'])(?<name>#{GEM_NAME})\k<gq>|%q<(?<name>#{GEM_NAME})>)/

I want to convert these into a regex that can be used in python and other languages.

I tried (?:(["'])([a-zA-Z0-9\-_\.]+)\k["']|%q<([a-zA-Z0-9\-_\.]+)>) based on substitution and several similar combinations but none of them worked. Here's the regexr link http://regexr.com/3g527

Can someone please explain what should be correct process for converting these ruby regular expression defintions into a form that can be used by python.

Python re can't handle identically named groups in 1 regex. Also, to define a named group, you need to use (?P<name>). — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jun 11, 2017 at 14:18

Wiktor Stribiżew · Accepted Answer · 2017-06-11 14:56:42Z

To define a named group, you need to use (?P<name>) and then (?p=name) named If you can afford a 3rd party library, you may use PyPi regex module and use the approach you had in Ruby (as regex supports multiple identically named capturing groups):

s = """%q<Some-name1> "some-name2" 'some-name3'"""

GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'(?:(?P<gq>["\'])(?<name>{0})(?P=gq)|%q<(?P<name>{0})>)'.format(GEM_NAME)
print(QUOTED_GEM_NAME)
# => # (?:(?P<gq>["\'])(?<name>[a-zA-Z0-9_.-]+)(?P=gq)|%q<(?P<name>[a-zA-Z0-9_.-]+)>)

import regex
res = [x.group("name") for x in regex.finditer(QUOTED_GEM_NAME, s)]
print(res)
# => ['Some-name1', 'some-name2', 'some-name3']

backreference in the replacement pattern.

See this Python demo.

If you decide to go with Python re, it can't handle identically named groups in one regex pattern.

You can discard the named groups altogether and use numbered ones, and use re.finditer to iterate over all the matches with comprehension to grab the right capture.

Example Python code:

import re
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r"([\"'])({0})\1|%q<({0})>".format(GEM_NAME)
s = """%q<Some-name1> "some-name2" 'some-name3'"""
matches = [x.group(2) if x.group(1) else x.group(3) for x in re.finditer(QUOTED_GEM_NAME, s)]
print(matches)
# => ['Some-name1', 'some-name2', 'some-name3']

So, ([\"'])({0})\1|%q<({0})> has got 3 capturing groups: if Group 1 matches, the first alternative got matched, thus, Group 2 is taken, else, the second alternative matched, and Group 3 value is grabbed in the comprehension.

Pattern details

([\"']) - Group 1: a " or '
({0}) - Group 2: GEM_NAME pattern
\1 - inline backreference to the Group 1 captured value (note that r'...' raw string literal allows using a single backslash to define a backreference in the string literal)
| - or
%q< - a literal substring
({0}) - Group 3: GEM_NAME pattern
> - a literal >.

Community · Accepted Answer · 2020-06-20 09:12:55Z

You can rewrite your pattern like this:

GEM_NAME = r'[a-zA-Z0-9_.-]+'

QUOTED_GEM_NAME = r'''["'%] # first possible character
    (?:(?<=%)q<)? # if preceded by a % match "q<"
    (?P<name> # the three possibilities excluding the delimiters
        (?<=") {0} (?=") |
        (?<=') {0} (?=') |
        (?<=<) {0} (?=>)
    )
    ["'>] #'"# closing delimiter
    (?x) # switch the verbose mode on for all the pattern
'''.format(GEM_NAME)

demo

Advantages:

the pattern doesn't start with an alternation that makes the search slow. (the alternation here is only tested at interesting positions after a quote or a %, when your version tests each branch of the alternation for each position in the string). This optimisation technique is called "the first character discrimination" and consists to quickly discard useless positions in a string.
you need only one capture group occurrence (quotes and angle brackets are excluded from it and only tested with lookarounds). This way you can use re.findall to get a list of gems without further manipulation.
the gq group wasn't useful and was removed (shorten a pattern at the cost of creating a useless capture group isn't a good idea)

Note that you don't need to escape the dot inside a character class.

user557597 · Accepted Answer · 2017-06-11 18:11:10Z

1

A simple way is to use a conditional and consolidate the name.

(?:(?:(["'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))

Expanded

 (?:
      (?:                           # Delimiters
           ( ["'] )                      # (1), ' or "
        |                              # or,
           %q<                           # %q
      )
      (?P<name> [a-zA-Z0-9\-_\.]+ ) # (2), Name
      (?(1) \1 | > )                # Did group 1 match ? match it here, else >
 )

Python

import re

s = ' "asdf"  %q<asdfasdf>  '

print ( re.findall( r'(?:(?:(["\'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))', s ) )

Output

[('"', 'asdf'), ('', 'asdfasdf')]

answered Jun 11, 2017 at 18:11

user557597

Collectives™ on Stack Overflow

Convert ruby regular expression definition to python regex

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related