parsing 1 table from html that has multiple tables

Question

i've built a crawl for a page that has only 1 table and is already setup with columns and such. Pretty straight forward. this website has 3 different tables, broken out in random cells through out. I only need info from the first table. I've created a list of the info i need. Not sure how to organize it and get it to run by pulling urls from a csv file.

if i break it down to just one url i can print the info from the license. But i can't get it to work for multiple urls. i feel like i'm totally over complicating things.

Here are some examples of the urls i'm trying to run:

http://search.ccb.state.or.us/search/business_details.aspx?id=221851
http://search.ccb.state.or.us/search/business_details.aspx?id=221852
http://search.ccb.state.or.us/search/business_details.aspx?id=221853

The code is all jacked up, but here's what i've got

I appreciate any and all help

import csv
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup as BS
from email import encoders
import time
import os
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase

def get_page():
    contents = []
    with open('OR_urls.csv','r') as csvf:
    urls = 'csv.reader(csvf)'
    r = requests.get(url)


data = {}

data['biz_info_object'] = soup(id='MainContent_contractornamelabel')[0].text.strip()
data['lic_number_object'] = soup(id='MainContent_licenselabel')[0].text.strip()
data['lic_date_object']  = soup(id='MainContent_datefirstlabel')[0].text.strip()
data['lic_status_object']  = soup(id='MainContent_licensestatuslabel')[0].text.strip()
data['lic_exp_object']  = soup(id='MainContent_licenseexpirelabel')[0].text.strip()
data['biz_address_object']  = soup(id='MainContent_addresslabel')[0].text.strip()
data['biz_phone_object']  = soup(id='MainContent_phonelabel')[0].text.strip()
data['biz_address_object']  = soup(id='MainContent_endorsementlabel')[0].text.strip()


with open('OR_urls.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        page = ('get_page')
        df1 = pd.read_html(page)

Can you add some sample URLs from OR-urls.csv so we can run your script? Also page = ('get_page') doesn't make sense? — Martin Evans
– Martin Evans, Commented Sep 23, 2018 at 15:47
just added a few urls to look at. Also, you are correct page = ('get_page') doesn't make sense. i'm a newbie to this and learning as i go and my brain went to mush after a couple of hours of trying to figure it out. I tried multiple solution from post on SO and probably intermingled a couple of them towards the end. :-/ — RobK
– RobK, Commented Sep 23, 2018 at 20:56

Martin Evans · Accepted Answer · 2018-09-24 19:49:13Z

1

As you say, you appear to have combined a few different scripts. Hopefully the following will help you better understand the required structure.

I assume that your OR_urls.csv file contains your URLs in the first column. It reads a row at a time from the CSV file and uses the requests.get() library call to return the webpage. This is then parsed with BeautifulSoup and your various elements are extracted from the page into a dictionary. This is then displayed along with the URL.

from bs4 import BeautifulSoup
import requests
import csv

with open('OR_urls.csv') as f_input:
    csv_input = csv.reader(f_input)

    for url in csv_input:
        r = requests.get(url[0])        # Assume the URL is in the first column
        soup = BeautifulSoup(r.text, "html.parser")

        data = {}

        data['biz_info_object'] = soup.find(id='MainContent_contractornamelabel').get_text(strip=True)
        data['lic_number_object'] = soup.find(id='MainContent_licenselabel').get_text(strip=True)
        data['lic_date_object']  = soup.find(id='MainContent_datefirstlabel').get_text(strip=True)
        data['lic_status_object']  = soup.find(id='MainContent_licensestatuslabel').get_text(strip=True)
        data['lic_exp_object']  = soup.find(id='MainContent_licenseexpirelabel').get_text(strip=True)
        data['biz_address_object']  = soup.find(id='MainContent_addresslabel').get_text(strip=True)
        data['biz_phone_object']  = soup.find(id='MainContent_phonelabel').get_text(strip=True)
        data['biz_address_object']  = soup.find(id='MainContent_endorsementlabel').get_text(strip=True)

        print(url[0], data)

Giving you the following output:

http://search.ccb.state.or.us/search/business_details.aspx?id=221851 {'biz_info_object': 'ANDREW LLOYD PARRY', 'lic_number_object': '221851', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(802) 779-7180'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221852 {'biz_info_object': 'SHANE MICHAEL DALLMAN', 'lic_number_object': '221852', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 933-5406'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221853 {'biz_info_object': 'INTEGRITY HOMES NW INC', 'lic_number_object': '221853', 'lic_date_object': '7/24/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/24/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 522-6055'}

You can further improve this by creating a list for all the IDs you want, and using a dictionary comprehension to build it. A csv.DictWriter() could be used to write the data to a CSV file:

from bs4 import BeautifulSoup
import requests
import csv

objects = (
    ('biz_info_object', 'MainContent_contractornamelabel'),
    ('lic_number_object', 'MainContent_licenselabel'),
    ('lic_date_object', 'MainContent_datefirstlabel'),
    ('lic_status_object', 'MainContent_licensestatuslabel'),
    ('lic_exp_object', 'MainContent_licenseexpirelabel'),
    ('biz_address_object', 'MainContent_addresslabel'),
    ('biz_phone_object', 'MainContent_phonelabel'),
    ('biz_address_object', 'MainContent_endorsementlabel'),
)

with open('OR_urls.csv') as f_input, open('output.csv', 'w', newline='')  as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.DictWriter(f_output, fieldnames=[name for name, id in objects])
    csv_output.writeheader()

    for url in csv_input:
        r = requests.get(url[0])        # Assume the URL is in the first column
        soup = BeautifulSoup(r.text, "html.parser")
        data = {name : soup.find(id=id).get_text(strip=True) for name, id in objects}        
        csv_output.writerow(data)

edited Sep 24, 2018 at 19:49

answered Sep 24, 2018 at 10:33

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

RobK Over a year ago

That is awesome. lol...sleep deprivation is a bad thing. Thanks so much for your help with this.

Martin Evans Over a year ago

You're welcome! It could also be improved a bit with a dictionary comprehension.

RobK Over a year ago

once all the urls have been ran, it will throw the info it into a csv that will be automatically emailed to a few folks. Automating processes like this have saved me sooooo much time. So, again..i really appreciate the help.

RobK Over a year ago

Hey martin, i was wondering if there was a way to set this to a data frame that i can export to excel. (i hope i'm using the right verbage)

Martin Evans Over a year ago

You could easily write it as a CSV, I have updated the script to show how. If Excel format is needed I would suggest looking at the openpyxl library.

Collectives™ on Stack Overflow

parsing 1 table from html that has multiple tables

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related