I am trying to pull url from inside of html but seems that regex is not working. Any issue spotted ? Though when i take only a part of html for my website it works fine(have commented out that part of code)
I do know about scapy and beautifulSoap etc. but for due to restriction i don't want to use such modules.
page="ANY-XYZ-WEBSITE"
def extract_first_link():
urlopener=urllib.urlopen(page)
html=str(urlopener.read())
matchObj = re.match( '<a href="(.*)/([0-9a-zA-Z-]+)"', html, re.I)
#k = open ("file.txt",'w')
#k.write(html)
#print "matchObj.group() : ", matchObj.group(1)
#matchObj = re.match( '<a href="(.*)/([0-9a-zA-Z-]+)"', html[4111:4150], re.M|re.I)
print "matchObj.group() : ", matchObj.group()
print "matchObj.group() : ", matchObj.group(1)
print "matchObj.group() : ", matchObj.group(2)
if __name__=="__main__":
print extract_first_link()
re.searchinstead ofre.match.matchwill only look at the beginning of the string.(.*)with([^/]*)