Python index function

Question

I am writing a simple Python program which grabs a webpage and finds all the URL links in it. However I try to index the starting and ending delimiter (") of each href link but the ending one always indexed wrong.

# open a url and find all the links in it
import urllib2

url=urllib2.urlopen('right.html')
urlinfo = url.info()
urlcontent = url.read()
bodystart = urlcontent.index('<body')
print 'body starts at',bodystart
bodycontent = urlcontent[bodystart:].lower()
print bodycontent

linklist = []
n = bodycontent.index('<a href=')
while n:
    print n
    bodycontent = bodycontent[n:]
    a = bodycontent.index('"')
    b = bodycontent[(a+1):].index('"')
    print a, b
    linklist.append(bodycontent[(a+1):b])
    n = bodycontent[b:].index('<a href=')

print linklist

Ultcyber · Accepted Answer · 2016-09-07 08:41:31Z

3

I would suggest using a html parsing library instead of manually searching the DOM String.

Beautiful Soup is an excellent library for this purpose. Here is the reference link

With bs your link searching functionality could look like:

from bs4 import BeautifulSoup
soup = BeautifulSoup(bodycontent, 'html.parser')
linklist = [a.get('href') for a in soup.find_all('a')]

edited Sep 7, 2016 at 8:41

answered Sep 7, 2016 at 8:28

Ultcyber

4062 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python index function

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related