Having some issues with Python Exceptions in my script

Question

I am trying to scrape data from a few websites for a proof of concept project. Currently using Python3 with BS4 to collect the data required. I have a dictionary of URLS from three sites. Each of the sites requires a different method to collect the data as their HTML is different. I have been using a "Try, If, Else, stack but I keep running into issues, If you could have a look at my code and help me to fix it then that would be great!

As I add more sites to be scraped I will not be able to use "Try, If, Else" to cycle through various methods to find the correct way to scrape the data, how can I future-proof this code to allow me to add as many websites and scrape data from various elements contained within in the future?

# Scraping Script Here:

def job():

prices = {

    # LIVEPRICES

    "LIVEAUOZ":    {"url": "https://www.gold.co.uk/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "LiveAUOz"},


    # GOLD

    "GLDAU_BRITANNIA":    {"url": "https://www.gold.co.uk/gold-coins/gold-britannia-coins/britannia-one-ounce-gold-coin-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "Britannia"},
    "GLDAU_PHILHARMONIC": {"url": "https://www.gold.co.uk/gold-coins/austrian-gold-philharmoinc-coins/austrian-gold-philharmonic-coin/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "Philharmonic"},
    "GLDAU_MAPLE":        {"url":    "https://www.gold.co.uk/gold-coins/canadian-gold-maple-coins/canadian-gold-maple-coin/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "Maple"},

    # SILVER

    "GLDAG_BRITANNIA":    {"url": "https://www.gold.co.uk/silver-coins/silver-britannia-coins/britannia-one-ounce-silver-coin-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Silver",
                           "type":   "Britannia"},
    "GLDAG_PHILHARMONIC": {"url": "https://www.gold.co.uk/silver-coins/austrian-silver-philharmonic-coins/silver-philharmonic-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Silver",
                           "type":   "Philharmonic"}

}

response = requests.get(
    'https://www.gold.co.uk/silver-price/')
soup = BeautifulSoup(response.text, 'html.parser')
AG_GRAM_SPOT = soup.find(
    'span', {'name': 'current_price_field'}).get_text()

# Convert to float
AG_GRAM_SPOT = float(re.sub(r"[^0-9\.]", "", AG_GRAM_SPOT))
# No need for another lookup
AG_OUNCE_SPOT = AG_GRAM_SPOT * 31.1035

for coin in prices:
    response = requests.get(prices[coin]["url"])
    soup = BeautifulSoup(response.text, 'html.parser')

    try:
        text_price = soup.find(
            'td', {'id': 'total-price-inc-vat-1'}).get_text()        <-- Method 1

    except:
        text_price = soup.find(
            'td', {'id': 'total-price-inc-vat-1'}).get_text()        <-- Method 2

    else:
        text_price = soup.find(
            'td', {'class': 'gold-price-per-ounce'}).get_text()      

    # Grab the number
    prices[coin]["price"] = float(re.sub(r"[^0-9\.]", "", text_price))

# ============================================================================

root = etree.Element("root")

for coin in prices:
    coinx = etree.Element("coin")
    etree.SubElement(coinx, "trader", {
                     'variable': coin}).text = prices[coin]["trader"]
    etree.SubElement(coinx, "metal").text = prices[coin]["metal"]
    etree.SubElement(coinx, "type").text = prices[coin]["type"]
    etree.SubElement(coinx, "price").text = (
        "£") + str(prices[coin]["price"])
    root.append(coinx)

fName = './templates/data.xml'
with open(fName, 'wb') as f:
    f.write(etree.tostring(root, xml_declaration=True,
                           encoding="utf-8", pretty_print=True))

"I keep running into issues" please be more specific. Any errors? just the "future"? ...? — Jan Stránský
– Jan Stránský, Commented Jul 20, 2020 at 18:31
If i got correctly the documentation, .find returns None if anything is find. So maybe you can just use ordinary if-elif-else? — Jan Stránský
– Jan Stránský, Commented Jul 20, 2020 at 18:39

forgetso · Accepted Answer · 2020-07-21 16:47:27Z

1

Add a config for the scraping where each config is something like this:

prices = {
    "LIVEAUOZ": {
        "url": "https://www.gold.co.uk/",
        "trader": "Gold.co.uk",
        "metal": "Gold",
        "type": "LiveAUOz",
        "price": {
            "selector": '#id > div > table > tr',
            "parser": lambda x: float(re.sub(r"[^0-9\.]", "", x))
        }

    }
}

User the selector part of price to get the relevant part of HTML and then parse it with the parser function.

e.g.

for key, config in prices.items():
    response = requests.get(config['url'])
    soup = BeautifulSoup(response.text, 'html.parser')
    price_element = soup.find(config['price']['selector'])
    if price_element:
        AG_GRAM_SPOT = price_element.get_text()
        # convert to float
        AG_GRAM_SPOT = config['price']['parser'](AG_GRAM_SPOT)
        # etc

You can modify the config object as you need but it will probably be very similar for most sites. For example, the text parsing could very well always be the same so instead of lambda function, create a function with def.

def textParser(text):
    return float(re.sub(r"[^0-9\.]", "", text))

Then add the reference to textParser in the config.

prices = {
    "LIVEAUOZ": {
        "url": "https://www.gold.co.uk/",
        "trader": "Gold.co.uk",
        "metal": "Gold",
        "type": "LiveAUOz",
        "price": {
            "selector": '#id > div > table > tr',
            "parser": textParser
        }

    }
}

These steps will allow you to write generic code, saving all those try excepts.

edited Jul 21, 2020 at 16:47

answered Jul 20, 2020 at 18:21

forgetso

2,55420 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

arthem Over a year ago

Will give this a go! Cheers

arthem Over a year ago

The last part of the script, where should this be placed in the code?

forgetso Over a year ago

I've modified the answer to show the last part in the config. It's just an example of using a declared function instead of an anonymous lambda function.

forgetso Over a year ago

That just means there is some format error in what you've typed. Maybe a missing comma or something else.

forgetso Over a year ago

Your indentation was incorrect - the loops were not inside your job function. I've also moved the textParser out of the job function so that it sits on its own. pastebin.com/8kufKLPX. You should also consider using cron to run your script instead of timers in the script itself.

|

Collectives™ on Stack Overflow

Having some issues with Python Exceptions in my script

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related