Python finding multiple regex matches within a regex match

Question

I am trying to use re.sub() to change all html tags < and > to { and }. Here's the catch: I only want to change the matches between <table and </table>.

I can't for the life of me find a regex tutorial or post where one is able to change every regex match, but only between two other regex matches. I've looked at positive/negative lookahead and lookbehind tutorials, etc. but no luck. It's been a good few hours of searching before deciding to post.

Here is the best I've got so far:

(?<=<table)(?:.*?)(<)(?:.*)(?=<\/table>)

This will match one "<" between the table begin and end tags, but I don't know how to get it to match more than one. I've played around with making the any-character groups lazy or not lazy, etc. but no luck.

The point of all this is, I have a string with lots of HTML and I want to keep all of the HTML tags within tables, as well as the tables themselves.

My current plan is to change all of the tags within tables (and the table tags themselves) to either { or }, then delete all HTML tags < and > in the entire document, then change all { and } back to < and >. Doing this should preserve the tables (and any other tags inside).

Example of Input:

<font style = "font-family:inherit>
<any other HTML tags>

random text

<table cellpadding="0" cellspacing="0" style="font-family:times new 
roman;font-size:10pt;width:100%;border-collapse:collapse;text-align:left;">
<tr>
<td colspan="3">
<font style="font-family:inherit;font-size:12pt;font- 
weight:bold;">washington, d.c. 20549</font>
random text
<any other HTML tags within table tags>
</td>
</table>

random text

<font style = "font-family:inherit>

Example of Output:

<font style = "font-family:inherit>
<any other HTML tags>

random text

{table cellpadding="0" cellspacing="0" style="font-family:times new 
roman;font-size:10pt;width:100%;border-collapse:collapse;text-align:left;"}
{tr}
{td colspan="3"}
{font style="font-family:inherit;font-size:12pt;font- 
weight:bold;"}washington, d.c. 20549{/font}
random text
{any other HTML tags within table tags}
{/td}
{/table}

random text

<font style = "font-family:inherit>

Thank you, Grog

Please give an example of your html and what you want after replacing. — Poul Bak
– Poul Bak, Commented Nov 5, 2018 at 0:59
I just edited with example input and output. Thank you for letting me know, I appreciate it. — Grogsaurous
– Grogsaurous, Commented Nov 5, 2018 at 1:18

bunji · Accepted Answer · 2018-11-05 02:48:46Z

1

As Serge mentioned, this is not really a problem you want to tackle with a single regular expression, but with multiple regular expressions and some python magic:

def replacer(match):  # re.sub can take a function as the repl argument which gives you more flexibility
    choices = {'<':'{', '>':'}'}  # replace < with { and > with }
    return choices[match.group(0)]

result = []  # store the results here
for text in re.split(r'(?s)(?=<table)(.*)(?<=table>)', your_text): # split your text into table parts and non table parts
    if text.startswith('<table'): # if this is a table part, do the <> replacement 
        result.append(re.sub(r'[<>]', replacer, text))
    else: # otherwise leave it the same
        result.append(text)
print(''.join(result)) # join the list of strings to get the final result

check out the documentation for using a function for the repl argument for re.sub here

And an explanation of the regular expressions:

(?s)        # the . matches newlines 
(?=<table)  # positive look-ahead matching '<table'
(.*)        # matches everything between <table and table> (it is inclusive because of the look-ahead/behinds)   
(?<=table>) # positive look-behind matching 'table>'

Also note that because (.*) is in a capture group, it is included in the strings output by re.split (see here)

edited Nov 5, 2018 at 2:48

answered Nov 5, 2018 at 2:41

bunji

5,2331 gold badge19 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Grogsaurous Over a year ago

Thank you for the coding example Bungi, it worked perfectly! I'm also very happy to have learned something new from looking at your code, as I'm relatively new to Python (and programming in general), learning how to apply new methods is always great.

Serge · Accepted Answer · 2018-11-11 17:24:25Z

1

Don't be too hard on yourself. I am not sure is it possible in one shot with standard re sub. In fact, I think it is either not possible or highly complicated. For example custom functions in replace (you can stuff a lot of custom functionality in a your custom function, up to whole html parser)

Instead I highly recommend a simple solution is split and reassemble with split/join, or, may be, you would settle on a sequence of re replaces.

Assuming one table l = s.split('table>'); l = [1] will give you table content and l.split(. A multitable version is below

def curlyfy_el(s, tag='table'):

    return ('{%s' % tag).join(
                        [ ('{/%s}' % tag).join(
                                   [y if i != 0 else y.replace("<",  "{").replace(">", "}")
                                 for i, y in enumerate(x.split( '</%s>' % tag, 1)) 
    for x in s.split('<%s' % tag) ])

Sligty more readable

def curlyfy_el(s, tag='table'):
    h, *t = s.split('<%s' % tag)  # stplit on some pretable text and fragments starting with table
    r = [h]
    for x in t:
        head, *tail = x.split('</%s>' % tag, 1)  # select table body and rest, 1 is to keep duplicate closure of tag in one str
        head = head.replace("<", "{")
        head = head.replace(">", "}")
        r.append( ('{/%s}' % tag).join([head, *tail]))
    return ('{/%s}' % tag).join(r)

Generally for handling the HTML best to use some designated parsing libraries such as beautiful soup, the ad -hoc code will fail on many corner cases.

edited Nov 11, 2018 at 17:24

answered Nov 5, 2018 at 1:49

Serge

3,8453 gold badges20 silver badges39 bronze badges

2 Comments

Grogsaurous Over a year ago

Hi Serge, Thank you for your response, I appreciate it. I tried your method however it's difficult to use because there could be any number of tables in the file and when working with lists that could have any number of values, it starts to get messy with the rest of my coding. I'm sure that if I had more experience, I could get it to work, but I'm just not quite there yet.

Serge Over a year ago

enhanced for multitable arrays

Poul Bak · Accepted Answer · 2018-11-05 01:57:12Z

0

You can use the following regex to match and then replace with Group 1:

[\s\S]*(<table[\s\S]*?</table>)[\s\S]*

This will match anything before '<table', then create a Group 1 with the table content, and then match everything after that.

Replace with:

$1

That will give you only the table with content.

answered Nov 5, 2018 at 1:57

Poul Bak

11k5 gold badges39 silver badges70 bronze badges

1 Comment

Grogsaurous Over a year ago

Hi Poul, I tried this method however it doesn't quite get at what I'm asking. I wanted to replace all HTML tags within each table with a different type of tag, but leave all HTML tags outside every table as-is. I appreciate your response though!

Collectives™ on Stack Overflow

Python finding multiple regex matches within a regex match

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related