i've built a crawl for a page that has only 1 table and is already setup with columns and such. Pretty straight forward. this website has 3 different tables, broken out in random cells through out. I only need info from the first table. I've created a list of the info i need. Not sure how to organize it and get it to run by pulling urls from a csv file.
if i break it down to just one url i can print the info from the license. But i can't get it to work for multiple urls. i feel like i'm totally over complicating things.
Here are some examples of the urls i'm trying to run:
http://search.ccb.state.or.us/search/business_details.aspx?id=221851
http://search.ccb.state.or.us/search/business_details.aspx?id=221852
http://search.ccb.state.or.us/search/business_details.aspx?id=221853
The code is all jacked up, but here's what i've got
I appreciate any and all help
import csv
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup as BS
from email import encoders
import time
import os
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
def get_page():
contents = []
with open('OR_urls.csv','r') as csvf:
urls = 'csv.reader(csvf)'
r = requests.get(url)
data = {}
data['biz_info_object'] = soup(id='MainContent_contractornamelabel')[0].text.strip()
data['lic_number_object'] = soup(id='MainContent_licenselabel')[0].text.strip()
data['lic_date_object'] = soup(id='MainContent_datefirstlabel')[0].text.strip()
data['lic_status_object'] = soup(id='MainContent_licensestatuslabel')[0].text.strip()
data['lic_exp_object'] = soup(id='MainContent_licenseexpirelabel')[0].text.strip()
data['biz_address_object'] = soup(id='MainContent_addresslabel')[0].text.strip()
data['biz_phone_object'] = soup(id='MainContent_phonelabel')[0].text.strip()
data['biz_address_object'] = soup(id='MainContent_endorsementlabel')[0].text.strip()
with open('OR_urls.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
page = ('get_page')
df1 = pd.read_html(page)
OR-urls.csvso we can run your script? Alsopage = ('get_page')doesn't make sense?