5

I want to crawl the exact publish time for news articles published in the web.

Some webpage have nice and formatted header where I can extract "last-modified" or "publish-date", the information in the header is messy, but useable. (By the way, metadata_parser helps a lot!)

But larger news agency like BBC and CNN don't put date and time information in the html header. So I am trying to get date and publish time from the html code.

For BBC, the date time is embedded like:

<div data-timestamp-inserted="true" class="date date--v2" data-seconds="1447658338" data-datetime="16 November 2015">16 November 2015</div>

For CNN, it is like:

<p class="update-time">Updated 0137 GMT (0937 HKT) November 16, 2015 <span id="js-pagetop_video_source" class="video__source top_source">| Video Source: <a href="http://www.cnn.com/">CNN</a></span></p>

For nytimes,

<p class="byline-dateline"><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author" data-byline-name="AURELIEN BREEDEN" itemprop="name">AURELIEN BREEDEN</span>, </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><span class="byline-author" data-byline-name="KIMIKO DE FREYTAS-TAMURA" itemprop="name">KIMIKO DE FREYTAS-TAMURA</span> and </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person" itemid="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html"><a href="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html" rel="author" title="More Articles by KATRIN BENNHOLD"><span class="byline-author" data-byline-name="KATRIN BENNHOLD" itemprop="name">KATRIN BENNHOLD</span></a></span><time class="dateline" datetime="2015-11-16" itemprop="datePublished" content="2015-11-16">NOV. 16, 2015</time></p>

As can be seen, almost every news agency has their own way of putting data and time in the webpage.

My question is, is it possible to extract date time information using some kind of fuzzy search in BeautifulSoup and kind of package so I don't have to write rule for each website?

Thanks!

1
  • As You can see the Google's Search results pages, Not Every document has their release date. Because It is too hard to detect! There are two option. First, you have to make a rule based parser on every News Service. Or just inferencing. Commented Nov 18, 2015 at 4:44

2 Answers 2

4

In my experience and humble opinion, the best way to scrape generic information is with NER (Named-Entity Recognition) systems.

I would recommend to use Scrapinghub's webstruct library:

Webstruct is a library for creating statistical NER systems that work on HTML data, i.e. a library for building tools that extract named entities (addresses, organization names, open hours, etc) from webpages.

Unlike most NER systems, webstruct works on HTML data, not only on text data. This allows to define features that use HTML structure, and also to embed annotation results back into HTML.

Github repository: https://github.com/scrapinghub/webstruct

Documentation: http://webstruct.readthedocs.org/en/latest/

UPDATE:

As you need to scrape dates, you can also use Dateparser:

dateparser provides modules to easily parse localized dates in almost any string formats commonly found on web pages.

Github repository: https://github.com/scrapinghub/dateparser

Documentation: https://dateparser.readthedocs.org/en/latest/

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks, I will try it! Nice way, I never thought about it in this way.
@Sean, I'm glad my answer helped. It's better to do it this way instead of creating a specific pattern for every website.
Exactly, I am using Firefox extension to annotate the webpage right now. I trust this method. Defining rules is so middle-age.
@Sean, see the updated section of my answer. It will help you as well.
Sorry, the problem is actually detect the data string. If the string is located, it is relatively easier to process.
|
4

The htmldate module does just that, it is tested on different cases and features a series of robust heuristics so that you don't have to write code each time to scrape the date of the websites you're interested in.

It also uses dateparser to yield more precise results.

1. Install the package:

pip install htmldate

2. Retrieve a web page, parse it and output the date:

from htmldate import find_date

find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')

(disclaimer: I'm the author)

If the extraction doesn't work feel free to file a bug report on the issues page.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.