Get data from multiple URLs using Python

Question

I'm trying to do the following-

Go to a web page, enter a search term.
Get some data from it.
It in turn has multiple URLs in it. I need to parse each one of them to get some data out of them.

I can do 1 and 2. I do not understand how I can go to all the URLs and get data (which is similar in all the URLs, but not the same) from them.

EDIT: More information- I input the search terms from a csv file, get a few IDs (with URLs) from each page. I'd like to go to all these URLs to get more IDs from the following page. I want to write all these into a CSV file. Basically, I want my output to be something like this

Level1 ID1   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             .
             .
             .
             Level2 IDN   Level3 ID
Level1 ID2   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             .
             .
             .
             Level2 IDN   Level3 ID

There can be multiple Level2 IDs for each Level1 ID. But there will be only one corresponding Level3 ID for each Level2 ID.

CODE that I've written so far:

import pandas as pd
from bs4 import BeautifulSoup
from urllib import urlopen

colnames = ['A','B','C','D']
data = pd.read_csv('file.csv', names=colnames)
listofdata= list(data.A)
id = '\n'.join(listofdata[1:]) #to skip header


def download_gsm_number(gse_id):
    url = "http://www.example.com" + id
    readurl = urlopen(url)
    soup = BeautifulSoup(readurl)
    soup1 = str(soup)
    gsm_data = readurl.read()
    #url_file_handle.close()
    pattern=re.compile(r'''some(.*?)pattern''')  
    data = pattern.findall(soup1)
    col_width = max(len(word) for row in data for word in row)
    for row in data:
        lines = "".join(row.ljust(col_width))
        sequence = ''.join([c for c in lines])
        print sequence

But this is taking all the ids at once into the URL. As I mentioned before, I need to get level2 ids from the level1 ids given as input. Further, from level2 ids, I need level3 ids. Basically, if I get just one part (getting either level2 or level3 ids) from it, I can figure out the rest.

Explore Scrapy. If you think it is an overkill, explore BeautifulSoup — shaktimaan
– shaktimaan, Commented Aug 7, 2014 at 23:14

clifgray · Accepted Answer · 2014-08-08 00:01:28Z

3

I believe your answer is urllib.

It is actually as easy as going:

web_page = urllib.urlopen(url_string)

And then with that you can do normal file operations such as:

read()
readline()
readlines()
fileno()
close()
info()
getcode()
geturl()

From there I would suggest using BeautifulSoup to parse which is as easy as:

soup = BeautifulSoup(web_page.read())

And then you can do all the wonderful BeautifulSoup operations on it.

I would imagine Scrapy is overkill and there is a lot more overhead involved. BeautifulSoup has some great documentation, examples, and is just plain easy to use.

answered Aug 8, 2014 at 0:01

clifgray

4,43911 gold badges69 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Get data from multiple URLs using Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related