0

There are lots of XML and HTML parsers in Python and I am looking for a simple way to extract a section of a HTML document, preferably using an XPATH construct but that's only optional.

Here is an example

src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"

I want to extract the entire body of the element with id=content, so the result should be: <div id=content>AAA<B>BBB</B>CCC</div>

It would be if I can do this without installing a new library.

I would also prefer to get the original content of the desired element (not reformatted).

Usage of regexp is not allowed, as these are not safe for parsing XML/HTML.

2 Answers 2

1

To parse using a library - the best way is BeautifulSoup Here is a snippet of how it will work for you!

from BeautifulSoup import BeautifulSoup

src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"
soupy = BeautifulSoup( src )

content_divs = soupy.findAll( attrs={'id':'content'} )
if len(content_divs) > 0:
    # print the first one
    print str(content_divs[0])

    # to print the text contents
    print content_divs[0].text

    # or to print all the raw html
    for each in content_divs:
        print each
Sign up to request clarification or add additional context in comments.

Comments

0

Yea I have done this. It may not be the best way to do it but it works something like the code below. I didn't test this

import re

match = re.finditer("<div id=content>",src)
src = src[match.start():]

#at this point the string start with your div everything proceeding it has been stripped.
#This next part works because the first div in the string is the end of your div section.
match = re.finditer("</div>",src)
src = src[:match.end()]

src now has just the div your after in the string. If there are situations where there is another inside what you want you will just have to build a fancier search pattern for you re.finditer sections.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.