Getting nested data from MongoDB into a Pandas data frame

Question

I'm collecting Twitter data (tweets + meta data) into a MongoDB server. Now I want to do some statistical analysis. To get the data from MongoDB into a Pandas data frame I used the following code:

cursor = collection.find({},{'id': 1, 'text': 1})

tweet_fields = ['id', 'text']

result = pd.DataFrame(list(cursor), columns = tweet_fields)

This way i successfully loaded the data into Pandas, which is great. Now I wanted to do some analysis on the users that created the tweets which was also data I collected. This data is located in a nested part of the JSON (I'm not 100% sure if this is true JSON), for instance user.id which is the id of the Twitter user account.

I can just add that to the cursor using dot notation:

cursor = collection.find({},{'id': 1, 'text': 1, 'user.id': 1})

But this results in a NaN for that column. I found that the problem lies with the way the data is structured:

bit of the cursor without user.id:

[{'_id': ObjectId('561547ae5371c0637f57769e'),
  'id': 651795711403683840,
  'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s'},
 {'_id': ObjectId('561547bf5371c0637f5776ac'),
  'id': 651795781557583872,
  'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.'},
 {'_id': ObjectId('561547ab5371c0637f57769c'),
  'id': 651795699881889792,
  'text': 'RT @ansichtssache47: Geht gefälligst arbeiten, die #Flüchtlinge haben Hunger! http://t.co/QxUYfFjZB5 #grenzendicht #rente #ZivilerUngehorsa…'}]

bit of the cursor with user.id:

[{'_id': ObjectId('561547ae5371c0637f57769e'),
  'id': 651795711403683840,
  'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s',
  'user': {'id': 223528499}},
 {'_id': ObjectId('561547bf5371c0637f5776ac'),
  'id': 651795781557583872,
  'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.',
  'user': {'id': 3544739837}}]

So in short I don't understand how I get the nested part of my collected data in a separate column of my Pandas data frame.

metersk · Accepted Answer · 2015-10-26 14:19:03Z

11

I use a function like this to get nested JSON lines into a dataframe. It uses the handy pandas json.normalize function:

import pandas as pd
from bson import json_util, ObjectId
from pandas.io.json import json_normalize
import json

def mongo_to_dataframe(mongo_data):

        sanitized = json.loads(json_util.dumps(mongo_data))
        normalized = json_normalize(sanitized)
        df = pd.DataFrame(normalized)

        return df

Just pass your mongo data by calling the function with it as an argument.

sanitized = json.loads(json_util.dumps(mongo_data)) loads the JSON lines as regular JSON

normalized = json_normalize(sanitized) un-nests the data

df = pd.DataFrame(normalized) simply turns it into a dataframe

answered Oct 26, 2015 at 14:19

metersk

12.7k23 gold badges72 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Lam Over a year ago

Following this solution I would load ALL my data that is stored in the MongoDB into the data frame right? With the amount of data I already have that doesn't seem like something I want to do. When I would use what you propose would I need to use the Mongo export to JSON function as input for 'mongo_data'?

metersk Over a year ago

Yes, the way I use this function, I usually only query mongo for the data i need and load it all into a dataframe. I'm not sure i understand the second part of your question, but say all your mongo data is stored in a variable x. clean_df = mongo_to_dataframe(x) will give you a dataframe in clean_df of unnested mongo data.

Lam Over a year ago

You say you only query the data that you need before loading it into a dataframe. Do you mean that in my example I should use it as follows: sanitized = json.loads(json_util.dumps(collection.find({},{"id": 1, "text": 1, "user.id": 1}))) EDIT: This did the trick!! And the text parsing is even better than before!

iMajna Over a year ago

God bless you!!

shubham ranjan · Accepted Answer · 2023-10-02 19:50:05Z

0

Use PyMongoArrow. This is a tool built by MongoDB just for this purpose. It allows you to efficiently move data in and out of MongoDB into other data formats such as pandas DataFrame, NumPy Array, Apache Arrow Table.

It also supports nested data and allows you to optionally define schema of your data and their data types when moving data from one to another.

answered Oct 2, 2023 at 19:50

shubham ranjan

6775 silver badges10 bronze badges

Collectives™ on Stack Overflow

Getting nested data from MongoDB into a Pandas data frame

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related