How to extract html links with a matching word from a website using python

Question

I have an url, say http://www.bbc.com/news/world/asia/. Just in this page I wanted to extract all the links that has India or INDIA or india (should be case insensitive).

If I click any of the output links it should take me to the corresponding page, for example these are few lines that have india India shock over Dhoni retirement and India fog continues to cause chaos. If I click these links I should be redirected to http://www.bbc.com/news/world-asia-india-30640436 and http://www.bbc.com/news/world-asia-india-30630274 respectively.

import urllib
from bs4 import BeautifulSoup
import re
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
only_links = SoupStrainer('a', href=re.compile('india'))
print (only_links)

I wrote very basic minimal code in python 3.4.2.

You want all href which have a india as substring in it. right? — Vivek Sable
– Vivek Sable, Commented Jan 1, 2015 at 11:21

Martijn Pieters · Accepted Answer · 2015-01-01 11:25:20Z

You need to search for the word india in the displayed text. To do this you'll need a custom function instead:

from bs4 import BeautifulSoup
import requests

url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)

india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                           'href' in tag.attrs and
                           'india' in tag.get_text().lower())
results = soup.find_all(india_links)

The india_links lambda finds all tags that are <a> links with an href attribute and contain india (case insensitive) somewhere in the displayed text.

Note that I used the requests response object .content attribute; leave decoding to BeautifulSoup!

Demo:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
 <a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
 <a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
 <a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
 <a href="/news/world/asia/india/">India</a>,
 <a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
 <a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
 <a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
 <a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
 <a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]

Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555 link here; we had to use the lambda search because a search with a text regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.

A similar problem applies to the /news/world-asia-india-30632852 link; the nested <span> element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.

You can extract just the links with:

from urllib.parse import urljoin

result_links = [urljoin(url, tag['href']) for tag in results]

where all relative URLs are resolved relative to the original URL:

>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
 'http://www.bbc.com/news/world-asia-india-30647504',
 'http://www.bbc.com/news/world-asia-india-30640444',
 'http://www.bbc.com/news/world-asia-india-30640436',
 'http://www.bbc.com/news/world/asia/india/',
 'http://www.bbc.com/news/world-asia-india-30630274',
 'http://www.bbc.com/news/world-asia-india-30632852',
 'http://www.bbc.com/sport/0/cricket/30632182',
 'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
 'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']

Hi Martijin, Thanks for help. Iam seeing the results as you mentioned. How can i do this monitoring periodically say every 1 hour?
@sandy: that's too broad for comments to answer. There are many ways you can approach that based on your OS and other requirements, including a cron job or windows scheduler job.
@sandy: post a new question if you cannot find answers to your specific problems elsewhere.
Forgot to upvote, good point about the text argument. Thanks.

Collectives™ on Stack Overflow

How to extract html links with a matching word from a website using python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related