1

I have the following problem:

I would like to parse html files and get links from the html file. I can get links with the following code:

class MyHTMLParser(HTMLParser):
    links=[]
    def __init__(self,url):
        HTMLParser.__init__(self)
        self.url = url

    def handle_starttag(self, tag, attrs):
        try: 
            if tag == 'a':
                for name, value in attrs:
                    if name == 'href':
                        if value[:5]=="http:":
                            self.links.append(value)
        except: 
            pass

But I dont want to get audio files, video files, etc. I only want to get html links. How can I do that?

2

1 Answer 1

3

I can check link ending and if it is particular format I can avoid appending that link to the list. Is there other way?

You could look at the 'Content-Type' header:

import urllib2
url = 'https://stackoverflow.com/questions/13431060/python-html-parsing'
req = urllib2.Request(url)
req.get_method = lambda : 'HEAD'    
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)

yields

text/html; charset=utf-8

Many thanks to @JonClements for req.get_method = lambda : 'HEAD'. More info on this and alternate methods for sending a HEAD request can be found here.

Sign up to request clarification or add additional context in comments.

4 Comments

Instead of using Range - I'd probably go for request = urllib2.Request(someurl); request.get_method = lambda : 'HEAD'; response = urllib2.urlopen(request) and continue from there...
@JonClements: Thank you very much for the info. I didn't know you could do that.
@JonClements: What does it mean for req.get_method() to return HEAD? The docs seem to say it always returns GET or POST...?
If a payload is present in the request, then get_method is POST otherwise it's a GET - replacing the method is a very kludgly way of writing requests.head(url)...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.