1

I've following regexes defined for capturing the gem names in a Gemfile.

GEM_NAME = /[a-zA-Z0-9\-_\.]+/

QUOTED_GEM_NAME = /(?:(?<gq>["'])(?<name>#{GEM_NAME})\k<gq>|%q<(?<name>#{GEM_NAME})>)/

I want to convert these into a regex that can be used in python and other languages.

I tried (?:(["'])([a-zA-Z0-9\-_\.]+)\k["']|%q<([a-zA-Z0-9\-_\.]+)>) based on substitution and several similar combinations but none of them worked. Here's the regexr link http://regexr.com/3g527

Can someone please explain what should be correct process for converting these ruby regular expression defintions into a form that can be used by python.

1
  • Python re can't handle identically named groups in 1 regex. Also, to define a named group, you need to use (?P<name>). Commented Jun 11, 2017 at 14:18

3 Answers 3

1

To define a named group, you need to use (?P<name>) and then (?p=name) named If you can afford a 3rd party library, you may use PyPi regex module and use the approach you had in Ruby (as regex supports multiple identically named capturing groups):

s = """%q<Some-name1> "some-name2" 'some-name3'"""

GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'(?:(?P<gq>["\'])(?<name>{0})(?P=gq)|%q<(?P<name>{0})>)'.format(GEM_NAME)
print(QUOTED_GEM_NAME)
# => # (?:(?P<gq>["\'])(?<name>[a-zA-Z0-9_.-]+)(?P=gq)|%q<(?P<name>[a-zA-Z0-9_.-]+)>)

import regex
res = [x.group("name") for x in regex.finditer(QUOTED_GEM_NAME, s)]
print(res)
# => ['Some-name1', 'some-name2', 'some-name3']

backreference in the replacement pattern.

See this Python demo.

If you decide to go with Python re, it can't handle identically named groups in one regex pattern.

You can discard the named groups altogether and use numbered ones, and use re.finditer to iterate over all the matches with comprehension to grab the right capture.

Example Python code:

import re
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r"([\"'])({0})\1|%q<({0})>".format(GEM_NAME)
s = """%q<Some-name1> "some-name2" 'some-name3'"""
matches = [x.group(2) if x.group(1) else x.group(3) for x in re.finditer(QUOTED_GEM_NAME, s)]
print(matches)
# => ['Some-name1', 'some-name2', 'some-name3']

So, ([\"'])({0})\1|%q<({0})> has got 3 capturing groups: if Group 1 matches, the first alternative got matched, thus, Group 2 is taken, else, the second alternative matched, and Group 3 value is grabbed in the comprehension.

Pattern details

  • ([\"']) - Group 1: a " or '
  • ({0}) - Group 2: GEM_NAME pattern
  • \1 - inline backreference to the Group 1 captured value (note that r'...' raw string literal allows using a single backslash to define a backreference in the string literal)
  • | - or
  • %q< - a literal substring
  • ({0}) - Group 3: GEM_NAME pattern
  • > - a literal >.
Sign up to request clarification or add additional context in comments.

Comments

1

You can rewrite your pattern like this:

GEM_NAME = r'[a-zA-Z0-9_.-]+'

QUOTED_GEM_NAME = r'''["'%] # first possible character
    (?:(?<=%)q<)? # if preceded by a % match "q<"
    (?P<name> # the three possibilities excluding the delimiters
        (?<=") {0} (?=") |
        (?<=') {0} (?=') |
        (?<=<) {0} (?=>)
    )
    ["'>] #'"# closing delimiter
    (?x) # switch the verbose mode on for all the pattern
'''.format(GEM_NAME)

demo

Advantages:

  • the pattern doesn't start with an alternation that makes the search slow. (the alternation here is only tested at interesting positions after a quote or a %, when your version tests each branch of the alternation for each position in the string). This optimisation technique is called "the first character discrimination" and consists to quickly discard useless positions in a string.
  • you need only one capture group occurrence (quotes and angle brackets are excluded from it and only tested with lookarounds). This way you can use re.findall to get a list of gems without further manipulation.
  • the gq group wasn't useful and was removed (shorten a pattern at the cost of creating a useless capture group isn't a good idea)

Note that you don't need to escape the dot inside a character class.

Comments

1

A simple way is to use a conditional and consolidate the name.

(?:(?:(["'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))

Expanded

 (?:
      (?:                           # Delimiters
           ( ["'] )                      # (1), ' or "
        |                              # or,
           %q<                           # %q
      )
      (?P<name> [a-zA-Z0-9\-_\.]+ ) # (2), Name
      (?(1) \1 | > )                # Did group 1 match ? match it here, else >
 )

Python

import re

s = ' "asdf"  %q<asdfasdf>  '

print ( re.findall( r'(?:(?:(["\'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))', s ) )

Output

[('"', 'asdf'), ('', 'asdfasdf')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.