0

I'm having some trouble extracting player ID's from a site's HTML. I've done this before and not had an issue, but the href's for this specific html are a bit different and have me stumped. Below is a portion of the HTML and the script I've put together that returns {} for each row after printing. The ID below is 'lynnla02' and appears in the HTML twice so extracting either version would be fine. Any help would be greatly appreciated.

HTML:

<tr data-row="248">
   <th scope="row" class="right " data-stat="ranker" csk="240">1</th>
   <td class="left " data-append-csv="lynnla01" data-stat="player">
      <a href="/players/l/lynnla01.shtml">Lance Lynn</a>

One of my attempts:

ID = []

for tag in soup.select('a[href^=/players]'):
    link = tag['href']
    query = parse_qs(link)
    ID.append(query)

print(ID)
2
  • Take a look at BeautifulSoup module Commented Apr 23, 2018 at 18:19
  • This is just a typo. You need to use selectors with the following format: soup.select('a[href^="/players"]'). Commented Apr 24, 2018 at 6:03

1 Answer 1

2

Using built-in and BeautifulSoup

from bs4 import BeautifulSoup as bs

html = '''<tr data-row="248">
   <th scope="row" class="right " data-stat="ranker" csk="240">1</th>
   <td class="left " data-append-csv="lynnla01" data-stat="player">
      <a href="/players/l/lynnla01.shtml">Lance Lynn</a>'''

soup = bs(html, 'lxml')

hrefs = soup.find_all('a')

for a_tag in hrefs:
    if a_tag['href'].startswith('/players'):
        print(a_tag['href'])

With regular expressions:

regex = re.compile('/players.+')
a_tags = soup.find_all('a', href=regex)
#print (a_tags), you can loop for i... and do print(i['href'])

To print the specific piece of string you asked for:

for i in a_tags:
    only_specific = re.match(regex, i['href'])
    print(only_specific.group(1))
Sign up to request clarification or add additional context in comments.

2 Comments

Very nice. I made some slight alterations but this definitely helped me out. Now if I can just strip out 'lynnla01' from '/players/l/lynnla01.shtml' I'll be good to go!
Done, note that the method .group(1) will return the specific name only and .group() will return the full matched regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.