How to extract partial text from href using BeautifulSoup in Python

Question

Here's my code:

for item in data:

print(item.find_all('td')[2].find('a'))
print(item.find('span').text.strip())
print(item.find_all('td')[3].text)
print(item.find_all('td')[2].find(target="_blank").string.strip())

It prints this text below.

<a href="argument_transcripts/2016/16-399_3f14.pdf" 
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile" 
target="_blank">16-399. </a>

Perry v. Merit Systems Protection Bd.

04/17/17

16-399.

All I want from the href tag is this part: 16-399_3f14

How can I do that? Thanks.

What kinds of things have you tried? re module provides powerful tools for extracting substrings from strings, however this case is simple enough you can probably do it with a couple calls to str.split. — robru
– robru, Commented Jun 27, 2017 at 21:45

Joe.Ingalls · Accepted Answer · 2017-06-27 21:45:04Z

1

You can use the find_all to pull the the anchor elements that have the href attribute and then parse the href values for the information that you are looking for.

from BeautifulSoup import BeautifulSoup

html = '''<a href="argument_transcripts/2016/16-399_3f14.pdf" 
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile" 
target="_blank">16-399. </a>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    url = a['href'].split('/')
    print url[-1]

This should output the the string you are looking for.

16-399_3f14.pdf

answered Jun 27, 2017 at 21:45

Joe.Ingalls

1961 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to extract partial text from href using BeautifulSoup in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related