0

i've built a crawl for a page that has only 1 table and is already setup with columns and such. Pretty straight forward. this website has 3 different tables, broken out in random cells through out. I only need info from the first table. I've created a list of the info i need. Not sure how to organize it and get it to run by pulling urls from a csv file.

if i break it down to just one url i can print the info from the license. But i can't get it to work for multiple urls. i feel like i'm totally over complicating things.

Here are some examples of the urls i'm trying to run:

http://search.ccb.state.or.us/search/business_details.aspx?id=221851
http://search.ccb.state.or.us/search/business_details.aspx?id=221852
http://search.ccb.state.or.us/search/business_details.aspx?id=221853

The code is all jacked up, but here's what i've got

I appreciate any and all help

import csv
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup as BS
from email import encoders
import time
import os
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase

def get_page():
    contents = []
    with open('OR_urls.csv','r') as csvf:
    urls = 'csv.reader(csvf)'
    r = requests.get(url)


data = {}

data['biz_info_object'] = soup(id='MainContent_contractornamelabel')[0].text.strip()
data['lic_number_object'] = soup(id='MainContent_licenselabel')[0].text.strip()
data['lic_date_object']  = soup(id='MainContent_datefirstlabel')[0].text.strip()
data['lic_status_object']  = soup(id='MainContent_licensestatuslabel')[0].text.strip()
data['lic_exp_object']  = soup(id='MainContent_licenseexpirelabel')[0].text.strip()
data['biz_address_object']  = soup(id='MainContent_addresslabel')[0].text.strip()
data['biz_phone_object']  = soup(id='MainContent_phonelabel')[0].text.strip()
data['biz_address_object']  = soup(id='MainContent_endorsementlabel')[0].text.strip()


with open('OR_urls.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        page = ('get_page')
        df1 = pd.read_html(page)
4
  • Well, can you please fix the indentation? Commented Sep 23, 2018 at 10:51
  • sorry about that. thanks for catching it. Commented Sep 23, 2018 at 10:54
  • Can you add some sample URLs from OR-urls.csv so we can run your script? Also page = ('get_page') doesn't make sense? Commented Sep 23, 2018 at 15:47
  • just added a few urls to look at. Also, you are correct page = ('get_page') doesn't make sense. i'm a newbie to this and learning as i go and my brain went to mush after a couple of hours of trying to figure it out. I tried multiple solution from post on SO and probably intermingled a couple of them towards the end. :-/ Commented Sep 23, 2018 at 20:56

1 Answer 1

1

As you say, you appear to have combined a few different scripts. Hopefully the following will help you better understand the required structure.

I assume that your OR_urls.csv file contains your URLs in the first column. It reads a row at a time from the CSV file and uses the requests.get() library call to return the webpage. This is then parsed with BeautifulSoup and your various elements are extracted from the page into a dictionary. This is then displayed along with the URL.

from bs4 import BeautifulSoup
import requests
import csv

with open('OR_urls.csv') as f_input:
    csv_input = csv.reader(f_input)

    for url in csv_input:
        r = requests.get(url[0])        # Assume the URL is in the first column
        soup = BeautifulSoup(r.text, "html.parser")

        data = {}

        data['biz_info_object'] = soup.find(id='MainContent_contractornamelabel').get_text(strip=True)
        data['lic_number_object'] = soup.find(id='MainContent_licenselabel').get_text(strip=True)
        data['lic_date_object']  = soup.find(id='MainContent_datefirstlabel').get_text(strip=True)
        data['lic_status_object']  = soup.find(id='MainContent_licensestatuslabel').get_text(strip=True)
        data['lic_exp_object']  = soup.find(id='MainContent_licenseexpirelabel').get_text(strip=True)
        data['biz_address_object']  = soup.find(id='MainContent_addresslabel').get_text(strip=True)
        data['biz_phone_object']  = soup.find(id='MainContent_phonelabel').get_text(strip=True)
        data['biz_address_object']  = soup.find(id='MainContent_endorsementlabel').get_text(strip=True)

        print(url[0], data)

Giving you the following output:

http://search.ccb.state.or.us/search/business_details.aspx?id=221851 {'biz_info_object': 'ANDREW LLOYD PARRY', 'lic_number_object': '221851', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(802) 779-7180'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221852 {'biz_info_object': 'SHANE MICHAEL DALLMAN', 'lic_number_object': '221852', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 933-5406'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221853 {'biz_info_object': 'INTEGRITY HOMES NW INC', 'lic_number_object': '221853', 'lic_date_object': '7/24/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/24/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 522-6055'}

You can further improve this by creating a list for all the IDs you want, and using a dictionary comprehension to build it. A csv.DictWriter() could be used to write the data to a CSV file:

from bs4 import BeautifulSoup
import requests
import csv

objects = (
    ('biz_info_object', 'MainContent_contractornamelabel'),
    ('lic_number_object', 'MainContent_licenselabel'),
    ('lic_date_object', 'MainContent_datefirstlabel'),
    ('lic_status_object', 'MainContent_licensestatuslabel'),
    ('lic_exp_object', 'MainContent_licenseexpirelabel'),
    ('biz_address_object', 'MainContent_addresslabel'),
    ('biz_phone_object', 'MainContent_phonelabel'),
    ('biz_address_object', 'MainContent_endorsementlabel'),
)

with open('OR_urls.csv') as f_input, open('output.csv', 'w', newline='')  as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.DictWriter(f_output, fieldnames=[name for name, id in objects])
    csv_output.writeheader()

    for url in csv_input:
        r = requests.get(url[0])        # Assume the URL is in the first column
        soup = BeautifulSoup(r.text, "html.parser")
        data = {name : soup.find(id=id).get_text(strip=True) for name, id in objects}        
        csv_output.writerow(data)
Sign up to request clarification or add additional context in comments.

5 Comments

That is awesome. lol...sleep deprivation is a bad thing. Thanks so much for your help with this.
You're welcome! It could also be improved a bit with a dictionary comprehension.
once all the urls have been ran, it will throw the info it into a csv that will be automatically emailed to a few folks. Automating processes like this have saved me sooooo much time. So, again..i really appreciate the help.
Hey martin, i was wondering if there was a way to set this to a data frame that i can export to excel. (i hope i'm using the right verbage)
You could easily write it as a CSV, I have updated the script to show how. If Excel format is needed I would suggest looking at the openpyxl library.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.