Remove HTML block in Python

Question

I'd like to know if there's a library or some method in Python to extract an element from an HTML document. For example:

I have this document:

<html>
      <head>
          ...
      </head>
      <body>
          <div>
           ...
          </div>
      </body>
</html>

I want to remove the <div></div> tag block along with the block contents from the document and then it'll be like that:

<html>
    <head>
     ...
    </head>
    <body>
    </body>
</html>

Do you want to remove only the <div></div> tags or both the tags & the contents inside those? — Soumendra
– Soumendra, Commented Aug 2, 2016 at 15:13
I want to remove the tags and the content between them. But only the content is ok as well :) — JefersonM
– JefersonM, Commented Aug 2, 2016 at 15:14
You can try reading the html file as xml and removing the div node. wiki.python.org/moin/PythonXml suggests using ElementTree — Simon Hänisch
– Simon Hänisch, Commented Aug 2, 2016 at 15:14
But the most important for me it's removing the content @SimonHänisch — JefersonM
– JefersonM, Commented Aug 2, 2016 at 15:16

Wso · Accepted Answer · 2016-08-02 15:30:40Z

7

You don't need a library for this. Just use built in string methods.

def removeOneTag(text, tag):
    return text[:text.find("<"+tag+">")] + text[text.find("</"+tag+">") + len(tag)+3:]

This will remove everything in-between the first opening and closing tag. So your input in the example would be something like...

    x = """<html>
    <head>
      ...
    </head>
    <body>
       <div>
         ...
       </div>
    </body>
</html>"""
print(removeOneTag(x, "div"))

Then if you wanted to remove ALL the tags...

while(tag in x):
    x = removeOneTag(x, tag)

answered Aug 2, 2016 at 15:30

Wso

3023 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JefersonM Over a year ago

Cool. I really wouldn't need a lib. Thanks!

Community · Accepted Answer · 2017-05-23 12:22:59Z

0

I personally feel that you don't need a library or something.

You can simply write a python script to read the html file and a regex to match your desired html tags and then do whatever you want to with it (delete in your case)

Though, there exist a library for the same.

See the official documentation -> https://docs.python.org/2/library/htmlparser.html

Also see this -> Extracting text from HTML file using Python

edited May 23, 2017 at 12:22

CommunityBot

11 silver badge

answered Aug 2, 2016 at 15:16

Ankush Raghuvanshi

1,44212 silver badges18 bronze badges

Comments

Frangipanes · Accepted Answer · 2016-08-02 15:15:40Z

-2

Try using a HTML parser such as BeautifulSoup to select the <div> DOM element. Then you can remove it using regex or similar.

answered Aug 2, 2016 at 15:15

Frangipanes

4204 silver badges15 bronze badges

Collectives™ on Stack Overflow

Remove HTML block in Python

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related