0

I would like to reduce

<p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> abcabc </p><p>
</p><p> defdef </p><p>
 </p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> xyzxyz

to

<p></p> abcabc </p><p>defdef</p><p></p> xyzxyz

I try:

str.replace('</p><p>+', '</p><p>') and

re.sub('</p><p>+', '</p><p>', str)

Both no luck, any advise as to the way to do? Many thanks.

1
  • What's the logic in the result that you want? What is the condition when <p></p> should be replaced? Commented Feb 7, 2017 at 4:09

3 Answers 3

1

Alternative approach: you can solve it with an HTML parser, like BeautifulSoup. The idea is to find all p elements except the first one and remove them from the tree:

In [1]: from bs4 import BeautifulSoup

In [2]: data = "<p></p><p></p><p></p><p></p>"

In [3]: soup = BeautifulSoup(data, "html.parser")

In [4]: for p in soup('p')[1:]:
   ...:     p.decompose()   

In [5]: print(soup)
<p></p>

Or, you can find the first p element and remove all the next p siblings:

In [6]: soup = BeautifulSoup(data, "html.parser")

In [7]: for p in soup.p.find_next_siblings('p'):
   ...:     p.decompose()  

In [8]: print(soup)
<p></p>

Updated solution for the updated problem (cleaning up p elements with an empty text):

In [10]: data = """<p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p> abcabc </p><p>
    ...: </p><p> defdef </p><p>
    ...:  </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p> xyzxyz"""

In [11]: soup = BeautifulSoup(data, "html.parser")

In [12]: for p in soup.find_all("p", text=lambda text: not text.strip()):
    ...:     p.decompose()
    ...:     

In [13]: print(soup)
 abcabc <p> defdef </p> xyzxyz
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the quick reply. Your solution works in the way of my specified text. Sorry for confusion. Actually there is some text in the string I want to remove duplicate. I have edited the questions to reflect the actual situation. Solution by using sub or replace would be appreciated.
@CL.L okay, I've updated the answer - please see if it works for you. Thanks.
0

I don't know why the previous answers were removed, but one of that hits the point with the following code:

str1= re.sub(r'\n', r'', re.sub(r'<p>\n?</p>(?![ \w]+)', r'', str1))

it can actually further simplify to that:

str1= re.sub(r'\n', r'', re.sub(r'<p>\n?</p>', r'', str1))

Credits should be given that person if who posts that answer again.

2 Comments

Well that was me but it doesn't gives you what you have asked in the question. And don't keep the variable name as str, it masks the builtin class str
Check now. Perhaps this answer will be closer to what you have put in the requirement.
0

You can try something like this:

import re

a="""<p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> abcabc </p><p>
</p><p> defdef </p><p>
 </p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> xyzxyz"""

print re.sub(r'</p><p>(?= ?</p><p>)', r'', re.sub(r'\n', r'', re.sub(r'<p>\n?</p>(?![ \w]+)', r'', a)))

Output:

<p></p> abcabc </p><p> defdef  </p><p></p> xyzxyz

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.