6

I'm collecting Twitter data (tweets + meta data) into a MongoDB server. Now I want to do some statistical analysis. To get the data from MongoDB into a Pandas data frame I used the following code:

cursor = collection.find({},{'id': 1, 'text': 1})

tweet_fields = ['id', 'text']

result = pd.DataFrame(list(cursor), columns = tweet_fields)

This way i successfully loaded the data into Pandas, which is great. Now I wanted to do some analysis on the users that created the tweets which was also data I collected. This data is located in a nested part of the JSON (I'm not 100% sure if this is true JSON), for instance user.id which is the id of the Twitter user account.

I can just add that to the cursor using dot notation:

cursor = collection.find({},{'id': 1, 'text': 1, 'user.id': 1})

But this results in a NaN for that column. I found that the problem lies with the way the data is structured:

bit of the cursor without user.id:

[{'_id': ObjectId('561547ae5371c0637f57769e'),
  'id': 651795711403683840,
  'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s'},
 {'_id': ObjectId('561547bf5371c0637f5776ac'),
  'id': 651795781557583872,
  'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.'},
 {'_id': ObjectId('561547ab5371c0637f57769c'),
  'id': 651795699881889792,
  'text': 'RT @ansichtssache47: Geht gefälligst arbeiten, die #Flüchtlinge haben Hunger! http://t.co/QxUYfFjZB5 #grenzendicht #rente #ZivilerUngehorsa…'}]

bit of the cursor with user.id:

[{'_id': ObjectId('561547ae5371c0637f57769e'),
  'id': 651795711403683840,
  'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s',
  'user': {'id': 223528499}},
 {'_id': ObjectId('561547bf5371c0637f5776ac'),
  'id': 651795781557583872,
  'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.',
  'user': {'id': 3544739837}}]

So in short I don't understand how I get the nested part of my collected data in a separate column of my Pandas data frame.

2 Answers 2

11

I use a function like this to get nested JSON lines into a dataframe. It uses the handy pandas json.normalize function:

import pandas as pd
from bson import json_util, ObjectId
from pandas.io.json import json_normalize
import json

def mongo_to_dataframe(mongo_data):

        sanitized = json.loads(json_util.dumps(mongo_data))
        normalized = json_normalize(sanitized)
        df = pd.DataFrame(normalized)

        return df

Just pass your mongo data by calling the function with it as an argument.

sanitized = json.loads(json_util.dumps(mongo_data)) loads the JSON lines as regular JSON

normalized = json_normalize(sanitized) un-nests the data

df = pd.DataFrame(normalized) simply turns it into a dataframe

Sign up to request clarification or add additional context in comments.

4 Comments

Following this solution I would load ALL my data that is stored in the MongoDB into the data frame right? With the amount of data I already have that doesn't seem like something I want to do. When I would use what you propose would I need to use the Mongo export to JSON function as input for 'mongo_data'?
Yes, the way I use this function, I usually only query mongo for the data i need and load it all into a dataframe. I'm not sure i understand the second part of your question, but say all your mongo data is stored in a variable x. clean_df = mongo_to_dataframe(x) will give you a dataframe in clean_df of unnested mongo data.
You say you only query the data that you need before loading it into a dataframe. Do you mean that in my example I should use it as follows: sanitized = json.loads(json_util.dumps(collection.find({},{"id": 1, "text": 1, "user.id": 1}))) EDIT: This did the trick!! And the text parsing is even better than before!
God bless you!!
0

Use PyMongoArrow. This is a tool built by MongoDB just for this purpose. It allows you to efficiently move data in and out of MongoDB into other data formats such as pandas DataFrame, NumPy Array, Apache Arrow Table.

It also supports nested data and allows you to optionally define schema of your data and their data types when moving data from one to another.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.