I'm trying to obtain data using regular expressions from a html file, by implementing the following code:
import urllib.request
def extract_words(wdict, urlname):
uf = urllib.request.urlopen(urlname)
text = uf.read()
print (text)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
which returns an error:
File "extract.py", line 33, in extract_words
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
File "/usr/lib/python3.1/re.py", line 192, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Upon experimenting further in the IDLE, I noticed that the uf.read() indeed returns the html source code the first time I invoke it. But then onwards, it returns a - b''. Is there any way to get around this?