Regex replace string which is before or after two different string

Question

I have this string (html):

html = 'x<sub>i</sub> - y<sub>i)<sub>2</sub>'

I would like to convert this html string to latex in a robust way. Let me explain:

SOMETHING -> converted to _{SOMETHING}

I already know how to do that:

latex = re.sub(r'<sub>(.*?)</sub>',r'_{\1} ', html)

Sometimes the first part  or its closing tag is missing, like in the example string. In that case, the output should still be correct.

So how I was thinking of doing it is: After running 1, I take the string after  and anything before  with _{SOMETHING}

text = re.sub(r'<sub>(.*?)</sub>',r'_{\1} ', html)
print(text)
# if missing part:
text = re.sub(r'<sub>(.*?)',r'_{\1} ', text)
print(text)
latex  = re.sub(r'(.*?)</sub>',r'_{\1} ', text)

… but I get:

x_{i}  - y_{i)<sub>2} 
x_{i}  - y_{i)_{} 2} 
x_{i}  - y_{i)_{} 2}

What I would like to get:

x_{i}  - y_{i})_{2}

Sounds like text = text.replace('', '_{').replace('', '}') should do. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 12, 2019 at 22:27
@WiktorStribiżew Thanks for your comment. When I try your command I get: x_{i} - y_{i)_{2}. It's almost good, but there is a missing }bracket after the second i. — henry
– henry, Commented Apr 14, 2019 at 15:51
How can you describe the place where the } is missing? It is not possible without more detailed requirements. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 14, 2019 at 17:38
@WiktorStribiżew That is very true. Sorry, yes, you are completely right. — henry
– henry, Commented Apr 14, 2019 at 17:40
My top comment solution is based on an assumption you have texts that are segmented into different parts, and the corresponding  may reside in the next segment, so it should suffice to just replace them one by one separately (this is a very common scenario in localization). That means you do not need to make any guess work. If it is not your case, you should explain the tagged text format or context the text appears in, else, the "regular" language is of no help. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 14, 2019 at 17:43

Wiktor Stribiżew · Accepted Answer · 2019-04-14 17:49:33Z

2

Assuming you have texts that are segmented into different parts, the corresponding  /  tags may reside in the adjoining segments, so it should suffice to just replace them one by one separately, and you do not need to make any guess work.

Just use

text = text.replace('<sub>', '_{').replace('</sub>', '}')

to replace each  with _{ and  with } in any context.

answered Apr 14, 2019 at 17:49

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wjandrea · Accepted Answer · 2019-04-13 17:29:26Z

1

You need to use greedy regexes (i.e. without ?) for the unmatched tags, otherwise you'll always get zero-width matches.

>>> text = '1<sub>2'
>>> re.sub(r'<sub>(.*)', r'_{\1} ', text)
'1_{2} '

BTW while figuring this out, I noticed you can put the second two regexes together like this:

re.sub(r'<sub>(.*)|(.*)</sub>', r'_{\1\2} ', text)

edited Apr 13, 2019 at 17:29

answered Apr 12, 2019 at 23:16

wjandrea

34k10 gold badges69 silver badges105 bronze badges

2 Comments

henry Over a year ago

This does not seem to work. I get: _{xi - yi)2}

wjandrea Over a year ago

@henry Yes, you need to replace the matched tags first, then run my regex to replace the unmatched tags. But don't worry about it anymore, since Wiktor's answer is better for your case.

Collectives™ on Stack Overflow

Regex replace string which is before or after two different string

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related