Regex for Removing Duplicate HTML Tags in Python

Question

I would like to reduce

<p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> abcabc </p><p>
</p><p> defdef </p><p>
 </p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> xyzxyz

to

<p></p> abcabc </p><p>defdef</p><p></p> xyzxyz

I try:

str.replace('+', '') and

re.sub('</p><p>+', '</p><p>', str)

Both no luck, any advise as to the way to do? Many thanks.

What's the logic in the result that you want? What is the condition when  should be replaced? — Mohammad Yusuf
– Mohammad Yusuf, Commented Feb 7, 2017 at 4:09

alecxe · Accepted Answer · 2017-02-07 04:20:23Z

1

Alternative approach: you can solve it with an HTML parser, like BeautifulSoup. The idea is to find all p elements except the first one and remove them from the tree:

In [1]: from bs4 import BeautifulSoup

In [2]: data = "<p></p><p></p><p></p><p></p>"

In [3]: soup = BeautifulSoup(data, "html.parser")

In [4]: for p in soup('p')[1:]:
   ...:     p.decompose()   

In [5]: print(soup)
<p></p>

Or, you can find the first p element and remove all the next p siblings:

In [6]: soup = BeautifulSoup(data, "html.parser")

In [7]: for p in soup.p.find_next_siblings('p'):
   ...:     p.decompose()  

In [8]: print(soup)
<p></p>

Updated solution for the updated problem (cleaning up p elements with an empty text):

In [10]: data = """<p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p> abcabc </p><p>
    ...: </p><p> defdef </p><p>
    ...:  </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p> xyzxyz"""

In [11]: soup = BeautifulSoup(data, "html.parser")

In [12]: for p in soup.find_all("p", text=lambda text: not text.strip()):
    ...:     p.decompose()
    ...:     

In [13]: print(soup)
 abcabc <p> defdef </p> xyzxyz

edited Feb 7, 2017 at 4:20

answered Feb 7, 2017 at 3:44

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

CL. L Over a year ago

Thanks for the quick reply. Your solution works in the way of my specified text. Sorry for confusion. Actually there is some text in the string I want to remove duplicate. I have edited the questions to reflect the actual situation. Solution by using sub or replace would be appreciated.

alecxe Over a year ago

@CL.L okay, I've updated the answer - please see if it works for you. Thanks.

CL. L · Accepted Answer · 2017-02-07 04:20:06Z

0

I don't know why the previous answers were removed, but one of that hits the point with the following code:

str1= re.sub(r'\n', r'', re.sub(r'<p>\n?</p>(?![ \w]+)', r'', str1))

it can actually further simplify to that:

str1= re.sub(r'\n', r'', re.sub(r'<p>\n?</p>', r'', str1))

Credits should be given that person if who posts that answer again.

edited Feb 7, 2017 at 4:20

answered Feb 7, 2017 at 4:12

CL. L

2571 gold badge8 silver badges22 bronze badges

2 Comments

Mohammad Yusuf Over a year ago

Well that was me but it doesn't gives you what you have asked in the question. And don't keep the variable name as str, it masks the builtin class str

Mohammad Yusuf Over a year ago

Check now. Perhaps this answer will be closer to what you have put in the requirement.

Mohammad Yusuf · Accepted Answer · 2017-02-07 04:44:41Z

0

You can try something like this:

import re

a="""<p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> abcabc </p><p>
</p><p> defdef </p><p>
 </p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> xyzxyz"""

print re.sub(r'</p><p>(?= ?</p><p>)', r'', re.sub(r'\n', r'', re.sub(r'<p>\n?</p>(?![ \w]+)', r'', a)))

Output:

<p></p> abcabc </p><p> defdef  </p><p></p> xyzxyz

edited Feb 7, 2017 at 4:44

answered Feb 7, 2017 at 4:02

Mohammad Yusuf

17.1k12 gold badges60 silver badges88 bronze badges

Collectives™ on Stack Overflow

Regex for Removing Duplicate HTML Tags in Python

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related