Recursive Web Scraping with Python Beautiful Soup

Question

I wrote a short program which should allow a user to specify a starting page in Discogs Wiki Style Guide, scrape the other styles listed on the page, and then output a graph (represented here as a dictionary of sets) of the relationship between subgenres.

I'm looking for guidance/critique on: (1) How to clean up the request_page function, I think there is a more elegant way both getting href attrs and filtering to only those with "/style/". (2) The general structure of the program. Self-taught and relative beginner so it's highly appreciated if anyone could point out general irregularities.

import re
import requests 
from bs4 import BeautifulSoup 

def get_related_styles(start):

    def request_page(start):

        response = requests.get('{0}{1}'.format(base_style_url, start))
        soup = BeautifulSoup(response.content,'lxml')

        ## these lines feel inelegant. considered solutions with
        ## soup.findAll('a', attrs = {'href': pattern.match})

        urls = [anchor.get('href') for anchor in soup.findAll('a')]
        pattern = re.compile('/style/[a-zA-Z0-9\-]*[^/]') # can use lookback regex w/ escape chars?
        style_urls = {pattern.match(url).group().replace('/style/','') for url in urls if pattern.match(url)}

        return style_urls

    def connect_styles(start , style_2):

        ## Nodes should not connect to self
        ## Note that styles are directed - e.g. (A ==> B) =/=> (B ==> A)

        if start != style_2:
            if start not in all_styles.keys():
                all_styles[start] = {style_2}

            else:
                all_styles[start].add(style_2)

        if style_2 not in do_not_visit:
            do_not_visit.add(style_2)
            get_related_styles(style_2)

    style_urls = request_page(start)

    for new_style in style_urls:
        connect_styles(start,new_style)

Example Use:

start = 'Avant-garde-Jazz'
base_style_url = 'https://reference.discogslabs.com/style/'

all_styles = {}
do_not_visit = {start}

get_related_styles(start)

print(all_styles)
{'Free-Jazz': {'Free-Improvisation', 'Free-Funk'}, 'Free-Improvisation': {'Free-Jazz', 'Avant-garde-Jazz'}, 'Avant-garde-Jazz': {'Free-Jazz'}, 'Free-Funk': {'Free-Jazz'}}

alecxe · Accepted Answer · 2018-01-04 04:30:53Z

There is a simpler way to filter out the "style" links - using a CSS selector with a partial match on the href attribute:

style_urls = {anchor['href'].replace('/style/', '') 
              for anchor in soup.select('a[href^="/style/"]')]

where ^= means "starts with".

Here we, of course, lose the check we had on the style name part of the href. If this check is really needed, we can also use a regular expression to match the desired style links directly:

pattern = re.compile('/style/([a-zA-Z0-9\-]*)[^/]')
style_urls = {pattern.search(anchor['href']).group(1)
              for anchor in soup('a', href=pattern)

soup() here is a short way of doing soup.find_all().

Stack Exchange Network

Recursive Web Scraping with Python Beautiful Soup

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Recursive Web Scraping with Python Beautiful Soup

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions