2

So I am doing some data analysis in which I am required to extract the page title, breadcrumb, h1 tags from hundreds of HTML and SHTML files.

Those tags are in the following format (meaning stuffs inside , and breadcrumb):

<title>Mapping a Drive: Macintosh OSX &lt; Mapping a Drive &lt; eHelp &lt; Cal Poly Pomona</title>

<p><!-- InstanceBeginEditable name="breadcrumb" --><a href="../index.html">eHelp</a> &raquo; <a href="index.shtml">Mapping a Drive</a> &raquo; Mac OS X<!-- InstanceEndEditable --></p>


<h1><a name="contentstart" id="contentstart"></a><!-- InstanceBeginEditable name="page_heading" --><a name="top" id="top"></a>Mapping a Drive:<span class="goldletter"> Macintosh </span>OS X  <!-- InstanceEndEditable --></h1>

After getting those tags, I want to further extract the first part of the title Mapping a Drive: Macintosh OSX, last part of the breadcrumb Mac OS X and the whole h1 Mapping a Drive: Macintosh OSX

Any idea how that can be accomplished?

7
  • 5
    Day by day, questions on parsing HTML with regex pop out. Read this if you haven't yet :-) Commented Sep 13, 2011 at 21:14
  • 1
    @tchrist It is a simile for something left behind to tell you how you got there so that you don't get lost. Commented Sep 13, 2011 at 21:35
  • 1
    @chown simile ≠ metaphor Commented Sep 13, 2011 at 21:36
  • 1
    @tchrist: it's a path from the site root that tells you how you get to the page you are viewing. Something like amazon > electronics > game console > PS3 Commented Sep 13, 2011 at 21:40
  • 2
    @chown, not that we're in english.stackexchange or anything, but the source you linked to says a simile uses the word "like" or similar. Your description of breadcrumb doesn't use anything of the kind. It is not a simile. Commented Sep 13, 2011 at 22:22

3 Answers 3

6

Use a real HTML parser, not a regex. You will be happier. lxml.html is highly regarded, as is BeautifulSoup.

Sign up to request clarification or add additional context in comments.

Comments

2

Since most HTML is basically xml (or can easily be trimmed to be compatible with most xml parsers) I would suggest using an xml parser. Most python HTML-specific parsers are just subclasses of an xml parser anyway.

Check out: Python and XML.

Here is a good tutorial: Python XML Parser Tutorial.

Also, the xml.dom.minidom Class has been super useful for me personally.

Another similar method is explained here: xml.etree.ElementTree.

This is a good example from the xml.dom.minidom reference page:

import xml.dom.minidom

document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>

<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""

dom = xml.dom.minidom.parseString(document)

def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

def handleSlideshow(slideshow):
    print "<html>"
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
    slides = slideshow.getElementsByTagName("slide")
    handleToc(slides)
    handleSlides(slides)
    print "</html>"

def handleSlides(slides):
    for slide in slides:
        handleSlide(slide)

def handleSlide(slide):
    handleSlideTitle(slide.getElementsByTagName("title")[0])
    handlePoints(slide.getElementsByTagName("point"))

def handleSlideshowTitle(title):
    print "<title>%s</title>" % getText(title.childNodes)

def handleSlideTitle(title):
    print "<h2>%s</h2>" % getText(title.childNodes)

def handlePoints(points):
    print "<ul>"
    for point in points:
        handlePoint(point)
    print "</ul>"

def handlePoint(point):
    print "<li>%s</li>" % getText(point.childNodes)

def handleToc(slides):
    for slide in slides:
        title = slide.getElementsByTagName("title")[0]
        print "<p>%s</p>" % getText(title.childNodes)

handleSlideshow(dom)

If you absolutely must use regex instead of a parser, check out the re module:

In [1]: import re
In [2]: grps = re.search(r"<([^>]+)>([^<]+)</\1>", "<abc>123</abc>")
In [3]: if grps:
In [4]:     print grps.groups()
Out[3]: ('abc', '123')

2 Comments

That does not apply to html found on the wild web, unfortunately.
Most pages these days are valid in the eyes of an xml parser. And if they aren't, you can easily subclass an xml parser, or "".replace() the parts that aren't (assuming what isnt valid is static).
0

html5lib is a very reliable html parser. Since your xhtml is somewhat broken, an xml parser will reject it. Fortunately, html5lib has lxml integration, so you can still use the full power of lxml and xpath to extract your data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.