How to get only links from parsed html using python?

Question

How can I get the links if the tag is in this form?

<a href="/url?q=instagram.com/goinggourmet/… class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Going Gourmet Catering (@goinggourmet) - Instagram</div></h3><div class="BNeawe UPmit AP7Wnd">www.instagram.com › goinggourmet</div></a>

I have tried the below code and it helped me get only URLs, but the URLs comes in this format.

/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-

/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e

/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR

I need only URLs from Facebook and Instagram, without any additional wordings, What I mean is I want only real link, not the redirected link.

I need something like this from above links,

'https://www.facebook.com/bespokecatering.sydney' 'https://www.instagram.com/bespoke_catering'

div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
    urls = link['href']
    print(urls)

Any help is much appreciated.

I tried the below code, but it returns empty results or different results

div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
   for link in w.select('a'):
     urls = link['href']
     print(urls)
     for url in urls:
        try:
            j=url.split('=')[1]
            k=  '/'.join(j.split('/')[0:4])
            #print(k) 
        except:
            k = ''

HedgeHog · Accepted Answer · 2021-09-03 12:10:34Z

1

You already have your <a> selected - Just loop over selection and print results via ['href']:

div = soup.find_all('div',attrs={'class':'kCrYT'})
    for w in div:
        for link in w.select('a'):
            print(link['href'])

If you improve your question and add additional information as requested, we can answer more detailed.

EDIT

Answering your additional question with a simple example (smth you should provide in your question)

import requests
from bs4 import BeautifulSoup
result = '''
<div class="kCrYT">
    <a href="/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-"></a>
</div>
<div class="kCrYT">
    <a href="/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e"></a>
</div>
<div class="kCrYT">
    <a href="/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR"></a>
</div>
'''
soup = BeautifulSoup(result, 'lxml')

div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
    for link in w.select('a'):
        print(dict(x.split('=') for x in requests.utils.urlparse(link['href']).query.split('&'))['q'].split('%3F')[0])

Result:

https://bespokecatering.sydney/ https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/ https://www.instagram.com/bespoke_catering/

edited Sep 3, 2021 at 12:10

answered Sep 2, 2021 at 7:25

HedgeHog

25.4k5 gold badges18 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ahina7 Over a year ago

thanks that worked. But I have URLs like '''/url?q=facebook.com/bespokecatering.sydney/videos/…''' which doesn't work as a URL, so how can trim all URLs to something like this? ''' facebook.com/bespokecatering.sydney''' , Help is appreciated :)

HedgeHog Over a year ago

There are a lot of ways ;) - Please improve / edit your question(not comments) and post an url/code so that I can reproduce, would be great.

Ahina7 Over a year ago

Edited my question now, new to the platform so working around it. Please see if you can reproduce any results. TIA.

Collectives™ on Stack Overflow

How to get only links from parsed html using python?

1 Answer 1

EDIT

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

EDIT

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related