3

I'd like to know if there's a library or some method in Python to extract an element from an HTML document. For example:

I have this document:

<html>
      <head>
          ...
      </head>
      <body>
          <div>
           ...
          </div>
      </body>
</html>

I want to remove the <div></div> tag block along with the block contents from the document and then it'll be like that:

<html>
    <head>
     ...
    </head>
    <body>
    </body>
</html>

6
  • Do you want to remove only the <div></div> tags or both the tags & the contents inside those? Commented Aug 2, 2016 at 15:13
  • I want to remove the tags and the content between them. But only the content is ok as well :) Commented Aug 2, 2016 at 15:14
  • You can try reading the html file as xml and removing the div node. wiki.python.org/moin/PythonXml suggests using ElementTree Commented Aug 2, 2016 at 15:14
  • But the most important for me it's removing the content @SimonHänisch Commented Aug 2, 2016 at 15:16
  • removing the node includes removing the content of the node Commented Aug 2, 2016 at 15:16

3 Answers 3

7

You don't need a library for this. Just use built in string methods.

def removeOneTag(text, tag):
    return text[:text.find("<"+tag+">")] + text[text.find("</"+tag+">") + len(tag)+3:]

This will remove everything in-between the first opening and closing tag. So your input in the example would be something like...

    x = """<html>
    <head>
      ...
    </head>
    <body>
       <div>
         ...
       </div>
    </body>
</html>"""
print(removeOneTag(x, "div"))

Then if you wanted to remove ALL the tags...

while(tag in x):
    x = removeOneTag(x, tag)
Sign up to request clarification or add additional context in comments.

1 Comment

Cool. I really wouldn't need a lib. Thanks!
0

I personally feel that you don't need a library or something.

You can simply write a python script to read the html file and a regex to match your desired html tags and then do whatever you want to with it (delete in your case)

Though, there exist a library for the same.

See the official documentation -> https://docs.python.org/2/library/htmlparser.html

Also see this -> Extracting text from HTML file using Python

Comments

-2

Try using a HTML parser such as BeautifulSoup to select the <div> DOM element. Then you can remove it using regex or similar.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.