0

Hi i want to scrap data from multiple URL, I am doing like

for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

but it not giving me complete data, it is printing only last url data,

here is my code, plz help

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import psycopg2
import operator


for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

    uClient = uReq(my_url)
    page1_html = uClient.read()
    uClient.close()
    # html parsing
    page1_soup = soup(page1_html, 'html.parser')

    # grabing data
    containers = page1_soup.findAll('div', {'class': 'PA15'})

    # Make the connection to PostgreSQL
    conn = psycopg2.connect(database='--',user='--', password='--', port=--)
    cursor = conn.cursor()
    for container in containers:
        toll_name1 = container.p.b.text
        toll_name = toll_name1.split(" ")[1]

        search1 = container.findAll('b')
        highway_number = search1[1].text.split(" ")[0]

        text = search1[1].get_text()
        onset = text.index('in')
        offset = text.index('Stretch')
        state = str(text[onset +2:offset]).strip(' ')

        location = list(container.p.descendants)[10]
        mystr = my_url[my_url.find('?'):]
        TID = mystr.strip('?TollPlazaID=')

        query = "INSERT INTO tollmaster (TID, toll_name, location, highway_number, state) VALUES (%s, %s, %s, %s, %s);"
        data = (TID, toll_name, location, highway_number, state)

        cursor.execute(query, data)

# Commit the transaction
conn.commit()

but it's displaying only second-last url data

5
  • your "format" statement generates only one url... Commented Sep 4, 2017 at 10:42
  • but i have so many other url's also , ex- http://tis.nhai.gov.in/TollInformation?TollPlazaID=203 http://tis.nhai.gov.in/TollInformation?TollPlazaID=258 ,, then how i have to do ? Commented Sep 4, 2017 at 10:44
  • 1
    I suppose sth like: my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i) Commented Sep 4, 2017 at 10:46
  • still its throwing error ` tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:] IndexError: list index out of range` Commented Sep 4, 2017 at 10:54
  • 1
    check the error it says that you are trying to access an item that doesn't exist Commented Sep 4, 2017 at 10:59

1 Answer 1

1

Seems like some of the pages are missing your key information, you can use error-catching for it, like this:

try: 
    tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
except IndexError:
    continue  # Skip this page if no items were scrapped

You may want to add some logging/print information to keep track of nonexisting tables.

EDIT: It's showing information from only last page, as you are commiting your transaction outside the for loop, overwriting your conn for every i. Just put conn.commit() inside for loop, at the far end.

Sign up to request clarification or add additional context in comments.

1 Comment

hey sorry, that was the wrong code , plz have a look my update code , here only last url data is printing in table

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.