6

Suppose I have a text like this:

/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ
EMC 

It is part of a PDF file. The line

[(\()-2(Y)7(o)7(u've )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ

contains the text "(You've got it)". So I first need to match text lines

^[(.*)]TJ$

Having the capture group of that, I can apply \(((.*?)\)[-0-9]*) and replace all matches by \2.

Is it possible to do this in one step?

10
  • 2
    Not possible with re in Python. Possible with regex package, but you don't want to do it unless you have no choice but to use a sinlge regex. I'm not sure if there is any exotic feature in regex that would help, though. Commented Jul 20, 2017 at 14:43
  • @nhahtdh: the regex module has all the features of your most crazy dreams. Commented Jul 20, 2017 at 14:46
  • @nhahtdh I see. Could you please post a link to the documentation of the regex module? Commented Jul 20, 2017 at 14:51
  • 1
    pypi.python.org/pypi/regex Commented Jul 20, 2017 at 14:51
  • So it is unusual to capture the matches of nested groups? Commented Jul 20, 2017 at 14:52

2 Answers 2

3

Using regular expressions to parse nested groups can be difficult, illegible or impossible to achieve.

One approach for addressing nested groups is to use a parsing grammar. Here is a 3-step example using the parsimonious library by Eric Rose.

Given

import itertools as it

import parsimonious as pars


source  = """\
/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t )4(i)-3(t)(\))]TJ
EMC"""

Code

  1. Define a Grammar
rules = r"""

    root            = line line message end

    line            = ANY NEWLINE
    message         = _ TEXT (_ TEXT*)* NEWLINE
    end             = "EMC" NEWLINE*

    TEXT            = ~r"[a-zA-Z ]+" 
    NEWLINE         = ~r"\n"
    ANY             = ~r"[^\n\r]*"

    _               = meaninglessness*
    meaninglessness = ~r"(TJ)*[^a-zA-Z\n\r]*"    

    """
  1. Parse source text and Build an AST
grammar = pars.grammar.Grammar(rules)
tree = grammar.parse(source)
# print(tree)
  1. Resolve the AST

class Translator(pars.NodeVisitor):
    
    def visit_root(self, node, children):
        return children

    def visit_line(self, node, children):
        return node.text
    
    def visit_message(self, node, children):
        _, s, remaining, nl = children
        return (s + "".join(it.chain.from_iterable(i[1] for i in remaining)) + nl)
        
    def visit_end(self, node, children):
        return node.text
    
    def visit_meaninglessness(self, node, children):
        return children
    
    def visit__(self, node, children):
        return children[0]
    
    def visit_(self, node, children):
        return children
    
    def visit_TEXT(self, node, children):
        return node.text
    
    def visit_NEWLINE(self, node, children):
        return node.text
    
    def visit_ANY(self, node, children):
        return node.text

Demo

tr = Translator().visit(tree)
print("".join(tr))

Output

/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
You got it
EMC

Details

  1. Instead of a rigid (sometimes illegible regular expression), we define a set of regex/EBNF-like grammar rules see docs for details. Once a grammar is defined, it can be much easier to adjust if required.
  • Note: the original text was modified, adding a space to 2(t) (line 3) as it was believed to be missing from the OP.
  1. The parsing step is simple. Just parse the source text base on the grammar. If the grammar is sufficiently defined, an AST is created with nodes that reflect the structure of your source. Having an AST is key as it makes this approach flexible.
  2. Define what to do when each node is visited. One can resolve an AST using any desired technique. As an example, here we demonstrate the Visitor Pattern through subclassing NodeVisitor from parsmonious.

Now for new or unexpected texts encountered in your PDFs, simply modify the grammar and parse again.

Sign up to request clarification or add additional context in comments.

Comments

2

With the regex module you can use this pattern:

pat=r'(?:\G(?!\A)\)|\[(?=[^]]*]))[^](]*\(([^)\\]*(?:\\.[^)\\]*)*)(?:\)[^(]*]TJ)?'
regex.sub(pat, r'\1', s)

demo

pattern details:

(?: # two possible starts
    \G     # contiguous to a previous match
    (?!\A) # not at the start of the string
    \)     # a literal closing round bracket
  | # OR
    \[          # an opening square bracket
     (?=[^]]*]) # followed by a closing square bracket
)
[^](]* # all that isn't a closing square bracket or an opening round bracket
\(     # a literal opening round bracket
(      # capture group 1
    [^)\\]* # all characters except a closing round bracket or a backslash
    (?:\\.[^)\\]*)* # to deal with eventual escaped characters 
)
(?: \) [^(]* ] TJ )? # eventual end of the square bracket parts

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.