1

I am trying to use re.sub() to change all html tags < and > to { and }. Here's the catch: I only want to change the matches between <table and </table>.

I can't for the life of me find a regex tutorial or post where one is able to change every regex match, but only between two other regex matches. I've looked at positive/negative lookahead and lookbehind tutorials, etc. but no luck. It's been a good few hours of searching before deciding to post.

Here is the best I've got so far:

(?<=<table)(?:.*?)(<)(?:.*)(?=<\/table>)

This will match one "<" between the table begin and end tags, but I don't know how to get it to match more than one. I've played around with making the any-character groups lazy or not lazy, etc. but no luck.

The point of all this is, I have a string with lots of HTML and I want to keep all of the HTML tags within tables, as well as the tables themselves.

My current plan is to change all of the tags within tables (and the table tags themselves) to either { or }, then delete all HTML tags < and > in the entire document, then change all { and } back to < and >. Doing this should preserve the tables (and any other tags inside).

Example of Input:

<font style = "font-family:inherit>
<any other HTML tags>

random text

<table cellpadding="0" cellspacing="0" style="font-family:times new 
roman;font-size:10pt;width:100%;border-collapse:collapse;text-align:left;">
<tr>
<td colspan="3">
<font style="font-family:inherit;font-size:12pt;font- 
weight:bold;">washington, d.c. 20549</font>
random text
<any other HTML tags within table tags>
</td>
</table>

random text

<font style = "font-family:inherit>

Example of Output:

<font style = "font-family:inherit>
<any other HTML tags>

random text

{table cellpadding="0" cellspacing="0" style="font-family:times new 
roman;font-size:10pt;width:100%;border-collapse:collapse;text-align:left;"}
{tr}
{td colspan="3"}
{font style="font-family:inherit;font-size:12pt;font- 
weight:bold;"}washington, d.c. 20549{/font}
random text
{any other HTML tags within table tags}
{/td}
{/table}

random text

<font style = "font-family:inherit>

Thank you, Grog

2
  • 1
    Please give an example of your html and what you want after replacing. Commented Nov 5, 2018 at 0:59
  • I just edited with example input and output. Thank you for letting me know, I appreciate it. Commented Nov 5, 2018 at 1:18

3 Answers 3

1

As Serge mentioned, this is not really a problem you want to tackle with a single regular expression, but with multiple regular expressions and some python magic:

def replacer(match):  # re.sub can take a function as the repl argument which gives you more flexibility
    choices = {'<':'{', '>':'}'}  # replace < with { and > with }
    return choices[match.group(0)]

result = []  # store the results here
for text in re.split(r'(?s)(?=<table)(.*)(?<=table>)', your_text): # split your text into table parts and non table parts
    if text.startswith('<table'): # if this is a table part, do the <> replacement 
        result.append(re.sub(r'[<>]', replacer, text))
    else: # otherwise leave it the same
        result.append(text)
print(''.join(result)) # join the list of strings to get the final result

check out the documentation for using a function for the repl argument for re.sub here

And an explanation of the regular expressions:

(?s)        # the . matches newlines 
(?=<table)  # positive look-ahead matching '<table'
(.*)        # matches everything between <table and table> (it is inclusive because of the look-ahead/behinds)   
(?<=table>) # positive look-behind matching 'table>'

Also note that because (.*) is in a capture group, it is included in the strings output by re.split (see here)

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the coding example Bungi, it worked perfectly! I'm also very happy to have learned something new from looking at your code, as I'm relatively new to Python (and programming in general), learning how to apply new methods is always great.
1

Don't be too hard on yourself. I am not sure is it possible in one shot with standard re sub. In fact, I think it is either not possible or highly complicated. For example custom functions in replace (you can stuff a lot of custom functionality in a your custom function, up to whole html parser)

Instead I highly recommend a simple solution is split and reassemble with split/join, or, may be, you would settle on a sequence of re replaces.

Assuming one table l = s.split('table>'); l = [1] will give you table content and l.split(. A multitable version is below

def curlyfy_el(s, tag='table'):

    return ('{%s' % tag).join(
                        [ ('{/%s}' % tag).join(
                                   [y if i != 0 else y.replace("<",  "{").replace(">", "}")
                                 for i, y in enumerate(x.split( '</%s>' % tag, 1)) 
    for x in s.split('<%s' % tag) ])

Sligty more readable

def curlyfy_el(s, tag='table'):
    h, *t = s.split('<%s' % tag)  # stplit on some pretable text and fragments starting with table
    r = [h]
    for x in t:
        head, *tail = x.split('</%s>' % tag, 1)  # select table body and rest, 1 is to keep duplicate closure of tag in one str
        head = head.replace("<", "{")
        head = head.replace(">", "}")
        r.append( ('{/%s}' % tag).join([head, *tail]))
    return ('{/%s}' % tag).join(r)

Generally for handling the HTML best to use some designated parsing libraries such as beautiful soup, the ad -hoc code will fail on many corner cases.

2 Comments

Hi Serge, Thank you for your response, I appreciate it. I tried your method however it's difficult to use because there could be any number of tables in the file and when working with lists that could have any number of values, it starts to get messy with the rest of my coding. I'm sure that if I had more experience, I could get it to work, but I'm just not quite there yet.
enhanced for multitable arrays
0

You can use the following regex to match and then replace with Group 1:

[\s\S]*(<table[\s\S]*?</table>)[\s\S]*

This will match anything before '<table', then create a Group 1 with the table content, and then match everything after that.

Replace with:

$1

That will give you only the table with content.

1 Comment

Hi Poul, I tried this method however it doesn't quite get at what I'm asking. I wanted to replace all HTML tags within each table with a different type of tag, but leave all HTML tags outside every table as-is. I appreciate your response though!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.