Find "date" in generic webpage using Python

Question

I want to crawl the exact publish time for news articles published in the web.

Some webpage have nice and formatted header where I can extract "last-modified" or "publish-date", the information in the header is messy, but useable. (By the way, metadata_parser helps a lot!)

But larger news agency like BBC and CNN don't put date and time information in the html header. So I am trying to get date and publish time from the html code.

For BBC, the date time is embedded like:

<div data-timestamp-inserted="true" class="date date--v2" data-seconds="1447658338" data-datetime="16 November 2015">16 November 2015</div>

For CNN, it is like:

<p class="update-time">Updated 0137 GMT (0937 HKT) November 16, 2015 <span id="js-pagetop_video_source" class="video__source top_source">| Video Source: <a href="http://www.cnn.com/">CNN</a></span></p>

For nytimes,

<p class="byline-dateline"><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author" data-byline-name="AURELIEN BREEDEN" itemprop="name">AURELIEN BREEDEN</span>, </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><span class="byline-author" data-byline-name="KIMIKO DE FREYTAS-TAMURA" itemprop="name">KIMIKO DE FREYTAS-TAMURA</span> and </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person" itemid="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html"><a href="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html" rel="author" title="More Articles by KATRIN BENNHOLD"><span class="byline-author" data-byline-name="KATRIN BENNHOLD" itemprop="name">KATRIN BENNHOLD</span></a></span><time class="dateline" datetime="2015-11-16" itemprop="datePublished" content="2015-11-16">NOV. 16, 2015</time></p>

As can be seen, almost every news agency has their own way of putting data and time in the webpage.

My question is, is it possible to extract date time information using some kind of fuzzy search in BeautifulSoup and kind of package so I don't have to write rule for each website?

Thanks!

As You can see the Google's Search results pages, Not Every document has their release date. Because It is too hard to detect! There are two option. First, you have to make a rule based parser on every News Service. Or just inferencing. — han058
– han058, Commented Nov 18, 2015 at 4:44

Andrés Pérez-Albela H. · Accepted Answer · 2015-11-18 18:22:31Z

4

In my experience and humble opinion, the best way to scrape generic information is with NER (Named-Entity Recognition) systems.

I would recommend to use Scrapinghub's webstruct library:

Webstruct is a library for creating statistical NER systems that work on HTML data, i.e. a library for building tools that extract named entities (addresses, organization names, open hours, etc) from webpages.

Unlike most NER systems, webstruct works on HTML data, not only on text data. This allows to define features that use HTML structure, and also to embed annotation results back into HTML.

Github repository: https://github.com/scrapinghub/webstruct

Documentation: http://webstruct.readthedocs.org/en/latest/

UPDATE:

As you need to scrape dates, you can also use Dateparser:

dateparser provides modules to easily parse localized dates in almost any string formats commonly found on web pages.

Github repository: https://github.com/scrapinghub/dateparser

Documentation: https://dateparser.readthedocs.org/en/latest/

edited Nov 18, 2015 at 18:22

answered Nov 18, 2015 at 4:44

Andrés Pérez-Albela H.

4,0211 gold badge21 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sean Over a year ago

Thanks, I will try it! Nice way, I never thought about it in this way.

Andrés Pérez-Albela H. Over a year ago

@Sean, I'm glad my answer helped. It's better to do it this way instead of creating a specific pattern for every website.

Sean Over a year ago

Exactly, I am using Firefox extension to annotate the webpage right now. I trust this method. Defining rules is so middle-age.

Andrés Pérez-Albela H. Over a year ago

@Sean, see the updated section of my answer. It will help you as well.

Sean Over a year ago

Sorry, the problem is actually detect the data string. If the string is located, it is relatively easier to process.

|

Hrvoje · Accepted Answer · 2021-02-26 10:38:09Z

4

The htmldate module does just that, it is tested on different cases and features a series of robust heuristics so that you don't have to write code each time to scrape the date of the websites you're interested in.

It also uses dateparser to yield more precise results.

1. Install the package:

pip install htmldate

2. Retrieve a web page, parse it and output the date:

from htmldate import find_date

find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')

(disclaimer: I'm the author)

If the extraction doesn't work feel free to file a bug report on the issues page.

edited Feb 26, 2021 at 10:38

Hrvoje

15.4k11 gold badges103 silver badges121 bronze badges

answered Jan 20, 2020 at 15:48

adbar

935 bronze badges

Collectives™ on Stack Overflow

Find "date" in generic webpage using Python

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related