1

Here's my code:

for item in data:

print(item.find_all('td')[2].find('a'))
print(item.find('span').text.strip())
print(item.find_all('td')[3].text)
print(item.find_all('td')[2].find(target="_blank").string.strip())

It prints this text below.

<a href="argument_transcripts/2016/16-399_3f14.pdf" 
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile" 
target="_blank">16-399. </a>

Perry v. Merit Systems Protection Bd.

04/17/17

16-399.

All I want from the href tag is this part: 16-399_3f14

How can I do that? Thanks.

1
  • 1
    What kinds of things have you tried? re module provides powerful tools for extracting substrings from strings, however this case is simple enough you can probably do it with a couple calls to str.split. Commented Jun 27, 2017 at 21:45

1 Answer 1

1

You can use the find_all to pull the the anchor elements that have the href attribute and then parse the href values for the information that you are looking for.

from BeautifulSoup import BeautifulSoup

html = '''<a href="argument_transcripts/2016/16-399_3f14.pdf" 
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile" 
target="_blank">16-399. </a>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    url = a['href'].split('/')
    print url[-1]

This should output the the string you are looking for.

16-399_3f14.pdf
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.