Parse specific links in html using HTMLParser in python?

Question

I am trying to parse a particular set of links from a html file, but since I am using HTMLParser I cannot access information of the html in a Hierarchy Tree and hence I cannot extract the information.

My HTML is as follows :

<p class="mediatitle">
        <a class="bullet medialink" href="link/to/a/file">Some Content
        </a>
</p>

So what I need is to extract all the values which have its key as 'href' and the previous attribute as class="bullet medialink". In other words I want only thode hrefs which are present in a tag with of class 'bullet medialink'

What I tried so far is

from HTMLParser import HTMLParser
import urllib
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
    if(tag == 'a'):
        for (key,value) in attrs:
            if(value == 'bullet medialink'):
                print "attr:", key

p = MyHTMLParser()
f = urllib.urlopen("sample.html")
html = f.read()
p.feed(html)
p.close()

Vincent Beltman · Accepted Answer · 2014-12-09 09:40:12Z

1

I would like Bs4 for this. Bs4 is a third party html parser. Documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

import urllib
from bs4 import BeautifulSoup

f = urllib.urlopen("sample.html")
html = f.read()
soup = BeautifulSoup(html)
for atag in soup.select('.bullet.medialink'):  # Just enter a css-selector here
    print atag['href']  # You can also get an atrriibute with atag.get('href')

Or shorter:

import urllib
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib.urlopen("sample.html").read())
for atag in soup.select('.bullet.medialink'):
    print atag

answered Dec 9, 2014 at 9:40

Vincent Beltman

2,11216 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

stochastic_zeitgeist Over a year ago

Thanks a lot. but since I am making a script to be used by a lot of people , I want to keep it simple by only using inbuilt python parsing features

Vincent Beltman Over a year ago

Hmm Maybe you could take a look at the inbuild etree library docs.python.org/2/library/xml.etree.elementtree.html Its not the best, but always better than htmlparser. And if you change your mind, you could always use lxml.de It uses the etree library, but is nicer to work with.

stochastic_zeitgeist · Accepted Answer · 2014-12-09 09:07:50Z

So I finally did it with a simple boolean flag owing to the fact that the HTMLParser isnt a hierarchical parser package.

Here's the code

from HTMLParser import HTMLParser
import urllib
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
    if(tag == 'a'):
        flag = 0
        for (key,value) in attrs:
                if(value == 'bullet medialink' and key == 'class'):
                    flag =1
                if(key == 'href' and flag == 1):    
                    print "link : ",value
                    flag = 0        

p = MyHTMLParser()
f = urllib.urlopen("sample.html")
html = f.read()
p.feed(html)
p.close()

Hope someone comes up with a more elegant solution.

Collectives™ on Stack Overflow

Parse specific links in html using HTMLParser in python?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related