2

I have a set of news articles in JSON format and I am having problems parsing the date of the data. The problem is that once the articles were converted in JSON format, the date successfully got converted but also did the edition. Here is an illustration:

{"date": "December 31, 1995, Sunday, Late Edition - Final", "body": "AFTER a year of dizzying new heights for the market, investors may despair of finding any good stocks left. Navistar plans to slash costs by $112 million in 1996. Advanced Micro Devices has made a key acquisition. For the bottom-fishing investor, therefore, the big nail-biter is: Will the changes be enough to turn a company around? ", "title": "INVESTING IT;"}
{"date": "December 31, 1995, Sunday, Late Edition - Final", "body": "Few issues stir as much passion in so many communities as the simple act of moving from place to place: from home to work to the mall and home again.  It was an extremely busy and productive year for us, said Frank J. Wilson, the State Commissioner of Transportation. There's a sense of urgency to get things done. ", "title": "ROAD AND RAIL;"}
{"date": "December 31, 1996, Sunday, Late Edition - Final", "body": "Widespread confidence in the state's economy prevailed last January as many businesses celebrated their most robust gains since the recession. And Steven Wynn, the chairman of Mirage Resorts, who left Atlantic City eight years ago because of local and state regulations, is returning to build a $1 billion two-casino complex. ", "title": "NEW JERSEY & CO.;"}

Since I am aiming at counting the number of articles that contain certain words I loop the articles in the following way:

import json
import re
import pandas

for i in range(1995,2017):
    df = pandas.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
# Parse dates and set index
    df.date = pandas.to_datetime(df.date) # is giving me a problem
    df.set_index('date', inplace=True)

I am looking after orientation on how to tackle the problem in the most efficient way. I was thinking of something such "ignore anything that goes after the date of the week" when parsing the date. Is there such thing?

Thanks in advance

1 Answer 1

2

You can split column date by str.split, concanecate first and second column - month, day and year together (December 31 and 1995)and last call to_datetime:

for i in range(1995,2017):
    df = pandas.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
    # Parse dates and set index
    #print (df)
    a = df.date.str.split(', ', expand=True)
    df.date = a.iloc[:,0] + ' ' + a.iloc[:,1]
    df.date = pandas.to_datetime(df.date) 
    df.set_index('date', inplace=True)
    print (df)

                                                    body  \
date                                                            
1995-12-31  AFTER a year of dizzying new heights for the m...   
1995-12-31  Few issues stir as much passion in so many com...   
1996-12-31  Widespread confidence in the state's economy p...   

                        title  
date                           
1995-12-31      INVESTING IT;  
1995-12-31     ROAD AND RAIL;  
1996-12-31  NEW JERSEY & CO.;  
Sign up to request clarification or add additional context in comments.

3 Comments

Glad can help you!
Hi @jezrael, I have encountered some situation where the year has no comma at the end: December 31, 1995 instead of December 31, 1995,. the parsing in this scenario seems to get stuck. Any ideas of how to solve it, thanks a lot!
Hmmm, do you think {"date": "December 31, 1995 Sunday, Late Edition - Final", "body": "AFTER a year of dizzying new heights for the market, investors may despair of finding any good stocks left. Navistar plans to slash costs by $112 million in 1996. Advanced Micro Devices has made a key acquisition. For the bottom-fishing investor, therefore, the big nail-biter is: Will the changes be enough to turn a company around? ", "title": "INVESTING IT;"} ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.