Extracting values from HTML with Python

Question

I'm having some trouble extracting player ID's from a site's HTML. I've done this before and not had an issue, but the href's for this specific html are a bit different and have me stumped. Below is a portion of the HTML and the script I've put together that returns {} for each row after printing. The ID below is 'lynnla02' and appears in the HTML twice so extracting either version would be fine. Any help would be greatly appreciated.

HTML:

<tr data-row="248">
   <th scope="row" class="right " data-stat="ranker" csk="240">1</th>
   <td class="left " data-append-csv="lynnla01" data-stat="player">
      <a href="/players/l/lynnla01.shtml">Lance Lynn</a>

One of my attempts:

ID = []

for tag in soup.select('a[href^=/players]'):
    link = tag['href']
    query = parse_qs(link)
    ID.append(query)

print(ID)

This is just a typo. You need to use selectors with the following format: soup.select('a[href^="/players"]'). — Keyur Potdar
– Keyur Potdar, Commented Apr 24, 2018 at 6:03

innicoder · Accepted Answer · 2018-04-23 18:51:44Z

2

Using built-in and BeautifulSoup

from bs4 import BeautifulSoup as bs

html = '''<tr data-row="248">
   <th scope="row" class="right " data-stat="ranker" csk="240">1</th>
   <td class="left " data-append-csv="lynnla01" data-stat="player">
      <a href="/players/l/lynnla01.shtml">Lance Lynn</a>'''

soup = bs(html, 'lxml')

hrefs = soup.find_all('a')

for a_tag in hrefs:
    if a_tag['href'].startswith('/players'):
        print(a_tag['href'])

With regular expressions:

regex = re.compile('/players.+')
a_tags = soup.find_all('a', href=regex)
#print (a_tags), you can loop for i... and do print(i['href'])

To print the specific piece of string you asked for:

for i in a_tags:
    only_specific = re.match(regex, i['href'])
    print(only_specific.group(1))

edited Apr 23, 2018 at 18:51

answered Apr 23, 2018 at 18:23

innicoder

2,7203 gold badges17 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nick Over a year ago

Very nice. I made some slight alterations but this definitely helped me out. Now if I can just strip out 'lynnla01' from '/players/l/lynnla01.shtml' I'll be good to go!

innicoder Over a year ago

Done, note that the method .group(1) will return the specific name only and .group() will return the full matched regex.

Collectives™ on Stack Overflow

Extracting values from HTML with Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related