0

I am trying to scrape data from a few websites for a proof of concept project. Currently using Python3 with BS4 to collect the data required. I have a dictionary of URLS from three sites. Each of the sites requires a different method to collect the data as their HTML is different. I have been using a "Try, If, Else, stack but I keep running into issues, If you could have a look at my code and help me to fix it then that would be great!

As I add more sites to be scraped I will not be able to use "Try, If, Else" to cycle through various methods to find the correct way to scrape the data, how can I future-proof this code to allow me to add as many websites and scrape data from various elements contained within in the future?

# Scraping Script Here:

def job():

prices = {

    # LIVEPRICES

    "LIVEAUOZ":    {"url": "https://www.gold.co.uk/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "LiveAUOz"},


    # GOLD

    "GLDAU_BRITANNIA":    {"url": "https://www.gold.co.uk/gold-coins/gold-britannia-coins/britannia-one-ounce-gold-coin-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "Britannia"},
    "GLDAU_PHILHARMONIC": {"url": "https://www.gold.co.uk/gold-coins/austrian-gold-philharmoinc-coins/austrian-gold-philharmonic-coin/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "Philharmonic"},
    "GLDAU_MAPLE":        {"url":    "https://www.gold.co.uk/gold-coins/canadian-gold-maple-coins/canadian-gold-maple-coin/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "Maple"},

    # SILVER

    "GLDAG_BRITANNIA":    {"url": "https://www.gold.co.uk/silver-coins/silver-britannia-coins/britannia-one-ounce-silver-coin-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Silver",
                           "type":   "Britannia"},
    "GLDAG_PHILHARMONIC": {"url": "https://www.gold.co.uk/silver-coins/austrian-silver-philharmonic-coins/silver-philharmonic-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Silver",
                           "type":   "Philharmonic"}

}

response = requests.get(
    'https://www.gold.co.uk/silver-price/')
soup = BeautifulSoup(response.text, 'html.parser')
AG_GRAM_SPOT = soup.find(
    'span', {'name': 'current_price_field'}).get_text()

# Convert to float
AG_GRAM_SPOT = float(re.sub(r"[^0-9\.]", "", AG_GRAM_SPOT))
# No need for another lookup
AG_OUNCE_SPOT = AG_GRAM_SPOT * 31.1035

for coin in prices:
    response = requests.get(prices[coin]["url"])
    soup = BeautifulSoup(response.text, 'html.parser')

    try:
        text_price = soup.find(
            'td', {'id': 'total-price-inc-vat-1'}).get_text()        <-- Method 1

    except:
        text_price = soup.find(
            'td', {'id': 'total-price-inc-vat-1'}).get_text()        <-- Method 2

    else:
        text_price = soup.find(
            'td', {'class': 'gold-price-per-ounce'}).get_text()      

    # Grab the number
    prices[coin]["price"] = float(re.sub(r"[^0-9\.]", "", text_price))

# ============================================================================

root = etree.Element("root")

for coin in prices:
    coinx = etree.Element("coin")
    etree.SubElement(coinx, "trader", {
                     'variable': coin}).text = prices[coin]["trader"]
    etree.SubElement(coinx, "metal").text = prices[coin]["metal"]
    etree.SubElement(coinx, "type").text = prices[coin]["type"]
    etree.SubElement(coinx, "price").text = (
        "£") + str(prices[coin]["price"])
    root.append(coinx)

fName = './templates/data.xml'
with open(fName, 'wb') as f:
    f.write(etree.tostring(root, xml_declaration=True,
                           encoding="utf-8", pretty_print=True))
3
  • "I keep running into issues" please be more specific. Any errors? just the "future"? ...? Commented Jul 20, 2020 at 18:31
  • 1
    If i got correctly the documentation, .find returns None if anything is find. So maybe you can just use ordinary if-elif-else? Commented Jul 20, 2020 at 18:39
  • Will give this a try! Commented Jul 20, 2020 at 20:07

1 Answer 1

1

Add a config for the scraping where each config is something like this:

prices = {
    "LIVEAUOZ": {
        "url": "https://www.gold.co.uk/",
        "trader": "Gold.co.uk",
        "metal": "Gold",
        "type": "LiveAUOz",
        "price": {
            "selector": '#id > div > table > tr',
            "parser": lambda x: float(re.sub(r"[^0-9\.]", "", x))
        }

    }
}

User the selector part of price to get the relevant part of HTML and then parse it with the parser function.

e.g.

for key, config in prices.items():
    response = requests.get(config['url'])
    soup = BeautifulSoup(response.text, 'html.parser')
    price_element = soup.find(config['price']['selector'])
    if price_element:
        AG_GRAM_SPOT = price_element.get_text()
        # convert to float
        AG_GRAM_SPOT = config['price']['parser'](AG_GRAM_SPOT)
        # etc

You can modify the config object as you need but it will probably be very similar for most sites. For example, the text parsing could very well always be the same so instead of lambda function, create a function with def.

def textParser(text):
    return float(re.sub(r"[^0-9\.]", "", text))

Then add the reference to textParser in the config.

prices = {
    "LIVEAUOZ": {
        "url": "https://www.gold.co.uk/",
        "trader": "Gold.co.uk",
        "metal": "Gold",
        "type": "LiveAUOz",
        "price": {
            "selector": '#id > div > table > tr',
            "parser": textParser
        }

    }
}

These steps will allow you to write generic code, saving all those try excepts.

Sign up to request clarification or add additional context in comments.

7 Comments

Will give this a go! Cheers
The last part of the script, where should this be placed in the code?
I've modified the answer to show the last part in the config. It's just an example of using a declared function instead of an anonymous lambda function.
That just means there is some format error in what you've typed. Maybe a missing comma or something else.
Your indentation was incorrect - the loops were not inside your job function. I've also moved the textParser out of the job function so that it sits on its own. pastebin.com/8kufKLPX. You should also consider using cron to run your script instead of timers in the script itself.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.