I'm trying to do the following-
- Go to a web page, enter a search term.
- Get some data from it.
- It in turn has multiple URLs in it. I need to parse each one of them to get some data out of them.
I can do 1 and 2. I do not understand how I can go to all the URLs and get data (which is similar in all the URLs, but not the same) from them.
EDIT: More information- I input the search terms from a csv file, get a few IDs (with URLs) from each page. I'd like to go to all these URLs to get more IDs from the following page. I want to write all these into a CSV file. Basically, I want my output to be something like this
Level1 ID1 Level2 ID1 Level3 ID
Level2 ID2 Level3 ID
.
.
.
Level2 IDN Level3 ID
Level1 ID2 Level2 ID1 Level3 ID
Level2 ID2 Level3 ID
.
.
.
Level2 IDN Level3 ID
There can be multiple Level2 IDs for each Level1 ID. But there will be only one corresponding Level3 ID for each Level2 ID.
CODE that I've written so far:
import pandas as pd
from bs4 import BeautifulSoup
from urllib import urlopen
colnames = ['A','B','C','D']
data = pd.read_csv('file.csv', names=colnames)
listofdata= list(data.A)
id = '\n'.join(listofdata[1:]) #to skip header
def download_gsm_number(gse_id):
url = "http://www.example.com" + id
readurl = urlopen(url)
soup = BeautifulSoup(readurl)
soup1 = str(soup)
gsm_data = readurl.read()
#url_file_handle.close()
pattern=re.compile(r'''some(.*?)pattern''')
data = pattern.findall(soup1)
col_width = max(len(word) for row in data for word in row)
for row in data:
lines = "".join(row.ljust(col_width))
sequence = ''.join([c for c in lines])
print sequence
But this is taking all the ids at once into the URL. As I mentioned before, I need to get level2 ids from the level1 ids given as input. Further, from level2 ids, I need level3 ids. Basically, if I get just one part (getting either level2 or level3 ids) from it, I can figure out the rest.