How to extract data from multiple URL using python

Question

Hi i want to scrap data from multiple URL, I am doing like

for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

but it not giving me complete data, it is printing only last url data,

here is my code, plz help

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import psycopg2
import operator


for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

    uClient = uReq(my_url)
    page1_html = uClient.read()
    uClient.close()
    # html parsing
    page1_soup = soup(page1_html, 'html.parser')

    # grabing data
    containers = page1_soup.findAll('div', {'class': 'PA15'})

    # Make the connection to PostgreSQL
    conn = psycopg2.connect(database='--',user='--', password='--', port=--)
    cursor = conn.cursor()
    for container in containers:
        toll_name1 = container.p.b.text
        toll_name = toll_name1.split(" ")[1]

        search1 = container.findAll('b')
        highway_number = search1[1].text.split(" ")[0]

        text = search1[1].get_text()
        onset = text.index('in')
        offset = text.index('Stretch')
        state = str(text[onset +2:offset]).strip(' ')

        location = list(container.p.descendants)[10]
        mystr = my_url[my_url.find('?'):]
        TID = mystr.strip('?TollPlazaID=')

        query = "INSERT INTO tollmaster (TID, toll_name, location, highway_number, state) VALUES (%s, %s, %s, %s, %s);"
        data = (TID, toll_name, location, highway_number, state)

        cursor.execute(query, data)

# Commit the transaction
conn.commit()

but it's displaying only second-last url data

but i have so many other url's also , ex- http://tis.nhai.gov.in/TollInformation?TollPlazaID=203 http://tis.nhai.gov.in/TollInformation?TollPlazaID=258 ,, then how i have to do ? — user8558176
– user8558176, Commented Sep 4, 2017 at 10:44
I suppose sth like: my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i) — coder
– coder, Commented Sep 4, 2017 at 10:46
still its throwing error ` tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:] IndexError: list index out of range` — user8558176
– user8558176, Commented Sep 4, 2017 at 10:54
check the error it says that you are trying to access an item that doesn't exist — coder
– coder, Commented Sep 4, 2017 at 10:59

Pythonist · Accepted Answer · 2017-09-04 13:50:34Z

1

Seems like some of the pages are missing your key information, you can use error-catching for it, like this:

try: 
    tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
except IndexError:
    continue  # Skip this page if no items were scrapped

You may want to add some logging/print information to keep track of nonexisting tables.

EDIT: It's showing information from only last page, as you are commiting your transaction outside the for loop, overwriting your conn for every i. Just put conn.commit() inside for loop, at the far end.

edited Sep 4, 2017 at 13:50

answered Sep 4, 2017 at 13:26

Pythonist

6966 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user8558176 Over a year ago

hey sorry, that was the wrong code , plz have a look my update code , here only last url data is printing in table

Collectives™ on Stack Overflow

How to extract data from multiple URL using python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related