1
    {"id": 814984317021495298, "date": "2016-12-30", "time": "18:59:37", "timezone": 
    "-0400", "replies_count": 7708, "username": "im_theantitrump"}
    {"id": 814984316195311616, "date": "2016-12-30", "time": "18:59:37", "timezone": 
    "-0400", "replies_count": 25772, "username": "bishyoucray2"}
    

My json file looks like that. How to create pandas dataframe with "date" and "replies count" without duplicates and in ascending date order? My current code drops one of the headers names and mixing dates sorting. df['date'].value_counts()

1
  • What does your expected output look like for these two entries? Commented Jul 1, 2021 at 21:18

2 Answers 2

1

Use pd.read_json with lines=True then select the desired columns:

df = pd.read_json('test.json', lines=True)[['date', 'replies_count']]

df:

        date  replies_count
0 2016-12-30           7708
1 2016-12-30          25772

test.json:

 {"id": 814984317021495298, "date": "2016-12-30", "time": "18:59:37", "timezone": "-0400", "replies_count": 7708, "username": "im_theantitrump"}
 {"id": 814984316195311616, "date": "2016-12-30", "time": "18:59:37", "timezone": "-0400", "replies_count": 25772, "username": "bishyoucray2"}
Sign up to request clarification or add additional context in comments.

3 Comments

I just learned json_normalize that I forget read_json :-) lol +1 (and gg for your 15k)
Thank you... I'm unsure if this is correct since there's the "without duplicates and in ascending date order?" part that this answer doesn't address, but I don't know. Thank you for that as well (likewise congrats on your 5, 6, and 7k)!
Maybe you can add .groupby('date').sum().sort_index(ascending=True)
1

Use json_normalize:

# records = json.load(open('data.json'))
>>> records
[
  {"id": 814984317021495298, "date": "2016-12-30", "time": "18:59:37", "timezone": 
    "-0400", "replies_count": 7708, "username": "im_theantitrump"},
  {"id": 814984316195311616, "date": "2016-12-30", "time": "18:59:37", "timezone": 
    "-0400", "replies_count": 25772, "username": "bishyoucray2"}
]


# Simple extraction of the 2 columns
>>> pd.json_normalize(records)[['date', 'replies_count']]

         date  replies_count
0  2016-12-30           7708
1  2016-12-30          25772


# Without duplicates and ascending sort dates
>>> pd.json_normalize(records)[['date', 'replies_count']] \
      .groupby('date').sum().sort_index(ascending=True)

            replies_count
date
2016-12-30          33480

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.