Parsing html tags with Python

Question

I have been given an url and I want to extract the contents of the <BODY> tag from the url. I'm using Python3. I came across sgmllib but it is not available for Python3.

Can someone please guide me with this? Can I use HTMLParser for this?

Here is what i tried:

import urllib.request
f=urllib.request.urlopen("URL")
s=f.read()

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)

parser = MyHTMLParser()
parser.feed(s)

this gives me error : TypeError: Can't convert 'bytes' object to str implicitly

"please guide me": Will do. Search. It's been asked. Many, many times. After you do the search (in the upper right corner), feel free to ask specific questions based on the answers already given. — S.Lott
– S.Lott, Commented Feb 1, 2012 at 20:11
@ghbhatt: show us a example of what you need. Otherwise see my answer is this what you are asking. — RanRag
– RanRag, Commented Feb 1, 2012 at 20:16

pycoder112358 · Accepted Answer · 2012-02-01 20:51:47Z

10

To fix the TypeError change line #3 to

s = str(f.read())

The web page you're getting is being returned in the form of bytes, and you need to change the bytes into a string to feed them to the parser.

answered Feb 1, 2012 at 20:51

pycoder112358

8855 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Lennart Regebro Over a year ago

You should find the encoding from the HTTP headers so you know what encoding to use.

RanRag · Accepted Answer · 2012-02-01 20:55:44Z

4

If you take a look at your s variable its type is byte.

>>> type(s)
<class 'bytes'>

and if you take a look at Parser.feed it requires a string or unicode as an argument.So,do

>>> x = s.decode('utf-8')
>>> type(x)
<class 'str'>
>>> parser.feed(x)

or do x = str(s).

edited Feb 1, 2012 at 20:55

answered Feb 1, 2012 at 20:16

RanRag

49.8k39 gold badges119 silver badges172 bronze badges

2 Comments

pycoder112358 Over a year ago

It seems that we gave the same answer with in a minute of each other.

Lennart Regebro Over a year ago

You should find the encoding from the HTTP headers so you know what encoding to use.

Collectives™ on Stack Overflow

Parsing html tags with Python

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related