Python, search for html tags inside a file using regex

Question

So I am doing some data analysis in which I am required to extract the page title, breadcrumb, h1 tags from hundreds of HTML and SHTML files.

Those tags are in the following format (meaning stuffs inside , and breadcrumb):

<title>Mapping a Drive: Macintosh OSX &lt; Mapping a Drive &lt; eHelp &lt; Cal Poly Pomona</title>

<p><!-- InstanceBeginEditable name="breadcrumb" --><a href="../index.html">eHelp</a> &raquo; <a href="index.shtml">Mapping a Drive</a> &raquo; Mac OS X<!-- InstanceEndEditable --></p>


<h1><a name="contentstart" id="contentstart"></a><!-- InstanceBeginEditable name="page_heading" --><a name="top" id="top"></a>Mapping a Drive:<span class="goldletter"> Macintosh </span>OS X  <!-- InstanceEndEditable --></h1>

After getting those tags, I want to further extract the first part of the title Mapping a Drive: Macintosh OSX, last part of the breadcrumb Mac OS X and the whole h1 Mapping a Drive: Macintosh OSX

Any idea how that can be accomplished?

Day by day, questions on parsing HTML with regex pop out. Read this if you haven't yet :-) — sidyll
– sidyll, Commented Sep 13, 2011 at 21:14
@tchrist It is a simile for something left behind to tell you how you got there so that you don't get lost. — chown
– chown, Commented Sep 13, 2011 at 21:35
@tchrist: it's a path from the site root that tells you how you get to the page you are viewing. Something like amazon > electronics > game console > PS3 — Tu Hoang
– Tu Hoang, Commented Sep 13, 2011 at 21:40
@chown, not that we're in english.stackexchange or anything, but the source you linked to says a simile uses the word "like" or similar. Your description of breadcrumb doesn't use anything of the kind. It is not a simile. — Ned Batchelder
– Ned Batchelder, Commented Sep 13, 2011 at 22:22

Ned Batchelder · Accepted Answer · 2011-09-13 21:14:26Z

6

Use a real HTML parser, not a regex. You will be happier. lxml.html is highly regarded, as is BeautifulSoup.

answered Sep 13, 2011 at 21:14

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

chown · Accepted Answer · 2011-09-13 21:31:20Z

Since most HTML is basically xml (or can easily be trimmed to be compatible with most xml parsers) I would suggest using an xml parser. Most python HTML-specific parsers are just subclasses of an xml parser anyway.

Check out: Python and XML.

Here is a good tutorial: Python XML Parser Tutorial.

Also, the xml.dom.minidom Class has been super useful for me personally.

Another similar method is explained here: xml.etree.ElementTree.

This is a good example from the xml.dom.minidom reference page:

import xml.dom.minidom

document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>

<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""

dom = xml.dom.minidom.parseString(document)

def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

def handleSlideshow(slideshow):
    print "<html>"
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
    slides = slideshow.getElementsByTagName("slide")
    handleToc(slides)
    handleSlides(slides)
    print "</html>"

def handleSlides(slides):
    for slide in slides:
        handleSlide(slide)

def handleSlide(slide):
    handleSlideTitle(slide.getElementsByTagName("title")[0])
    handlePoints(slide.getElementsByTagName("point"))

def handleSlideshowTitle(title):
    print "<title>%s</title>" % getText(title.childNodes)

def handleSlideTitle(title):
    print "<h2>%s</h2>" % getText(title.childNodes)

def handlePoints(points):
    print "<ul>"
    for point in points:
        handlePoint(point)
    print "</ul>"

def handlePoint(point):
    print "<li>%s</li>" % getText(point.childNodes)

def handleToc(slides):
    for slide in slides:
        title = slide.getElementsByTagName("title")[0]
        print "<p>%s</p>" % getText(title.childNodes)

handleSlideshow(dom)

If you absolutely must use regex instead of a parser, check out the re module:

In [1]: import re
In [2]: grps = re.search(r"<([^>]+)>([^<]+)</\1>", "<abc>123</abc>")
In [3]: if grps:
In [4]:     print grps.groups()
Out[3]: ('abc', '123')

That does not apply to html found on the wild web, unfortunately.
Most pages these days are valid in the eyes of an xml parser. And if they aren't, you can easily subclass an xml parser, or "".replace() the parts that aren't (assuming what isnt valid is static).

Tobu · Accepted Answer · 2011-09-13 21:36:36Z

0

html5lib is a very reliable html parser. Since your xhtml is somewhat broken, an xml parser will reject it. Fortunately, html5lib has lxml integration, so you can still use the full power of lxml and xpath to extract your data.

answered Sep 13, 2011 at 21:36

Tobu

25.6k4 gold badges94 silver badges100 bronze badges

Collectives™ on Stack Overflow

Python, search for html tags inside a file using regex

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related