How are nested groups addressed in RegEx?

Question

Suppose I have a text like this:

/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ
EMC

It is part of a PDF file. The line

[(\()-2(Y)7(o)7(u've )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ

contains the text "(You've got it)". So I first need to match text lines

^[(.*)]TJ$

Having the capture group of that, I can apply \(((.*?)\)[-0-9]*) and replace all matches by \2.

Is it possible to do this in one step?

Not possible with re in Python. Possible with regex package, but you don't want to do it unless you have no choice but to use a sinlge regex. I'm not sure if there is any exotic feature in regex that would help, though. — nhahtdh
– nhahtdh, Commented Jul 20, 2017 at 14:43
@nhahtdh: the regex module has all the features of your most crazy dreams. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Jul 20, 2017 at 14:46
@nhahtdh I see. Could you please post a link to the documentation of the regex module? — Martin Thoma
– Martin Thoma, Commented Jul 20, 2017 at 14:51

pylang · Accepted Answer · 2021-02-20 00:02:22Z

Using regular expressions to parse nested groups can be difficult, illegible or impossible to achieve.

One approach for addressing nested groups is to use a parsing grammar. Here is a 3-step example using the parsimonious library by Eric Rose.

Given

import itertools as it

import parsimonious as pars


source  = """\
/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t )4(i)-3(t)(\))]TJ
EMC"""

Code

Define a Grammar

rules = r"""

    root            = line line message end

    line            = ANY NEWLINE
    message         = _ TEXT (_ TEXT*)* NEWLINE
    end             = "EMC" NEWLINE*

    TEXT            = ~r"[a-zA-Z ]+" 
    NEWLINE         = ~r"\n"
    ANY             = ~r"[^\n\r]*"

    _               = meaninglessness*
    meaninglessness = ~r"(TJ)*[^a-zA-Z\n\r]*"    

    """

Parse source text and Build an AST

grammar = pars.grammar.Grammar(rules)
tree = grammar.parse(source)
# print(tree)

Resolve the AST


class Translator(pars.NodeVisitor):
    
    def visit_root(self, node, children):
        return children

    def visit_line(self, node, children):
        return node.text
    
    def visit_message(self, node, children):
        _, s, remaining, nl = children
        return (s + "".join(it.chain.from_iterable(i[1] for i in remaining)) + nl)
        
    def visit_end(self, node, children):
        return node.text
    
    def visit_meaninglessness(self, node, children):
        return children
    
    def visit__(self, node, children):
        return children[0]
    
    def visit_(self, node, children):
        return children
    
    def visit_TEXT(self, node, children):
        return node.text
    
    def visit_NEWLINE(self, node, children):
        return node.text
    
    def visit_ANY(self, node, children):
        return node.text

Demo

tr = Translator().visit(tree)
print("".join(tr))

Output

/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
You got it
EMC

Details

Instead of a rigid (sometimes illegible regular expression), we define a set of regex/EBNF-like grammar rules see docs for details. Once a grammar is defined, it can be much easier to adjust if required.

Note: the original text was modified, adding a space to 2(t) (line 3) as it was believed to be missing from the OP.

The parsing step is simple. Just parse the source text base on the grammar. If the grammar is sufficiently defined, an AST is created with nodes that reflect the structure of your source. Having an AST is key as it makes this approach flexible.
Define what to do when each node is visited. One can resolve an AST using any desired technique. As an example, here we demonstrate the Visitor Pattern through subclassing NodeVisitor from parsmonious.

Now for new or unexpected texts encountered in your PDFs, simply modify the grammar and parse again.

Casimir et Hippolyte · Accepted Answer · 2017-07-20 15:23:40Z

2

With the regex module you can use this pattern:

pat=r'(?:\G(?!\A)\)|\[(?=[^]]*]))[^](]*\(([^)\\]*(?:\\.[^)\\]*)*)(?:\)[^(]*]TJ)?'
regex.sub(pat, r'\1', s)

demo

pattern details:

(?: # two possible starts
    \G     # contiguous to a previous match
    (?!\A) # not at the start of the string
    \)     # a literal closing round bracket
  | # OR
    \[          # an opening square bracket
     (?=[^]]*]) # followed by a closing square bracket
)
[^](]* # all that isn't a closing square bracket or an opening round bracket
\(     # a literal opening round bracket
(      # capture group 1
    [^)\\]* # all characters except a closing round bracket or a backslash
    (?:\\.[^)\\]*)* # to deal with eventual escaped characters 
)
(?: \) [^(]* ] TJ )? # eventual end of the square bracket parts

edited Jul 20, 2017 at 15:23

answered Jul 20, 2017 at 15:12

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Collectives™ on Stack Overflow

How are nested groups addressed in RegEx?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related