0

I am writing a simple Python program which grabs a webpage and finds all the URL links in it. However I try to index the starting and ending delimiter (") of each href link but the ending one always indexed wrong.

# open a url and find all the links in it
import urllib2

url=urllib2.urlopen('right.html')
urlinfo = url.info()
urlcontent = url.read()
bodystart = urlcontent.index('<body')
print 'body starts at',bodystart
bodycontent = urlcontent[bodystart:].lower()
print bodycontent

linklist = []
n = bodycontent.index('<a href=')
while n:
    print n
    bodycontent = bodycontent[n:]
    a = bodycontent.index('"')
    b = bodycontent[(a+1):].index('"')
    print a, b
    linklist.append(bodycontent[(a+1):b])
    n = bodycontent[b:].index('<a href=')

print linklist

1 Answer 1

3

I would suggest using a html parsing library instead of manually searching the DOM String.

Beautiful Soup is an excellent library for this purpose. Here is the reference link

With bs your link searching functionality could look like:

from bs4 import BeautifulSoup
soup = BeautifulSoup(bodycontent, 'html.parser')
linklist = [a.get('href') for a in soup.find_all('a')]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.