1

I am trying to parse a particular set of links from a html file, but since I am using HTMLParser I cannot access information of the html in a Hierarchy Tree and hence I cannot extract the information.

My HTML is as follows :

<p class="mediatitle">
        <a class="bullet medialink" href="link/to/a/file">Some Content
        </a>
</p>

So what I need is to extract all the values which have its key as 'href' and the previous attribute as class="bullet medialink". In other words I want only thode hrefs which are present in a tag with of class 'bullet medialink'

What I tried so far is

from HTMLParser import HTMLParser
import urllib
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
    if(tag == 'a'):
        for (key,value) in attrs:
            if(value == 'bullet medialink'):
                print "attr:", key

p = MyHTMLParser()
f = urllib.urlopen("sample.html")
html = f.read()
p.feed(html)
p.close()

2 Answers 2

1

I would like Bs4 for this. Bs4 is a third party html parser. Documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

import urllib
from bs4 import BeautifulSoup

f = urllib.urlopen("sample.html")
html = f.read()
soup = BeautifulSoup(html)
for atag in soup.select('.bullet.medialink'):  # Just enter a css-selector here
    print atag['href']  # You can also get an atrriibute with atag.get('href')

Or shorter:

import urllib
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib.urlopen("sample.html").read())
for atag in soup.select('.bullet.medialink'):
    print atag
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot. but since I am making a script to be used by a lot of people , I want to keep it simple by only using inbuilt python parsing features
Hmm Maybe you could take a look at the inbuild etree library docs.python.org/2/library/xml.etree.elementtree.html Its not the best, but always better than htmlparser. And if you change your mind, you could always use lxml.de It uses the etree library, but is nicer to work with.
0

So I finally did it with a simple boolean flag owing to the fact that the HTMLParser isnt a hierarchical parser package.

Here's the code

from HTMLParser import HTMLParser
import urllib
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
    if(tag == 'a'):
        flag = 0
        for (key,value) in attrs:
                if(value == 'bullet medialink' and key == 'class'):
                    flag =1
                if(key == 'href' and flag == 1):    
                    print "link : ",value
                    flag = 0        

p = MyHTMLParser()
f = urllib.urlopen("sample.html")
html = f.read()
p.feed(html)
p.close()

Hope someone comes up with a more elegant solution.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.