2

I have the following text:

string = "<i>R</i> subspace  <i>{V.</i> generated by <i>{v<sub>1</sub>,...,v<sub>i</sub></i>, "

A careful reader might notice that there are two brackets missing. I was wondering, how this could be fixed using Python?

The expected output would be:

the <i>R</i>  subspace  <i>{V.}</i> generated by <i>{v<sub>1</sub>,...,v<sub>i</sub>}</i>,

One could:

  1. Check: Is there a bracket after <i> ?
  2. If yes -> Is there a bracket before </i> ?

How can I code this?

Edit

I have found this code, that can tell you if the brackets match or not.

8
  • What goes inside the brackets? Can "</i>" appear inside brackets (i.e. does "<i>{v<i></i></i>" turn into "<i>{v<i>}</i></i>" or "<i>{v<i></i>}</i>")? Commented Apr 11, 2019 at 21:16
  • @CraigMeier Good point. The second case would be the correct one. Commented Apr 11, 2019 at 21:19
  • @CraigMeier Would you know how to already code the simpler case (first case) ? Commented Apr 11, 2019 at 21:27
  • The first case can be handled by a regex (e.g. re.sub with a callable repl param that ensures there's a final '}' for any chunk with an initial '{'). But the second case falls under stackoverflow.com/a/1732454 Commented Apr 11, 2019 at 21:34
  • @CraigMeier Thanks for your input, but I don't know how to code it with repl. Sorry to bother you... I see that I can do: re.sub(r'<I>, SOMETHING, string) Commented Apr 11, 2019 at 21:45

1 Answer 1

2

How about the following regex solution:

import re

string = "<i>R</i> subspace  <i>{V.</i> generated by <i>{v<sub>1</sub>,...,v<sub>i</sub></i>, "
expected = "<i>R</i> subspace  <i>{V.}</i> generated by <i>{v<sub>1</sub>,...,v<sub>i</sub>}</i>, "

fixed = re.sub(r"<(?P<tag>.*?)>({.*?)</(?P=tag)>", r"<\1>\2}</\1>", string)

print(fixed == expected) # True

The idea is to match a tag followed by a brace, find its closing tag, and plop the companion brace before the closing tag using the capture groups as <\1>\2}</\1>. Breakdown:

< # literal opening bracket
 (?P<tag> # open a named capture group
         .*? # lazily match any characters
            ) # end named capture group
             > # literal closing bracket
              ( # open capture group 2
               { # literal opening brace
                .*? # lazily match any characters
                   ) # end capture group 2
                    < # literal opening bracket
                     / # literal slash
                      (?P=tag) # backreference to the named group
                              > # literal closing bracket

If you just want <i>, you can use re.sub(r"<i>({.*?)</i>", r"<i>\1}</i>", string).

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much for this perfect answer !!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.