Convert nested mongoDB document into flat pandas DataFrame (Array of objects within array of objects)

Question

I am trying to convert mongoDB documents into a flat pandas dataframe structure.

An example of my mongoDB collection structure:

data = collection.find_one({'ID':300})
print(data)

{'_id': "ObjectId('5cd932299f6b7d4c9b95af6c')",
 'ID': 300,
 'updated': 23424,
 'data': [
     { 'meta': 8,
       'data': [
           {'value1': 1, 'value2': 2}, 
           {'value1': 3, 'value2': 4}
       ]
     },
     { 'meta': 9,
       'data': [
           {'value1': 5, 'value2': 6}
       ]
     }
  ]
}

When i put this into a pandas dataframe, I get

df = pd.DataFrame(data)
print(df)

| _id                      | ID  | updated | data                                              
|
|--------------------------|-----|---------|------------------------ ---------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | {'meta': 8, 'data': [{'value1': 1, 'value2': 2... |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | {'meta': 9, 'data': [{'value1': 5, 'value2': 6}]} |

When I iterate through the dataframe with pd.concat I get

df.rename(columns={'data':'data1'}, inplace=True)
df2 = pd.concat([df, pd.DataFrame(list(df['data1']))], axis=1).drop('data1', 1)
df3 = pd.concat([df2, pd.DataFrame(list(df2['data']))], axis=1).drop('data', 1)
print(df3)

| _id                      | ID  | updated | meta | 0                          | 1                          |
|--------------------------|-----|---------|------|----------------------------|----------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | 8    | {'value1': 1, 'value2': 2} | {'value1': 3, 'value2': 4} |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | 9    | {'value1': 5, 'value2': 6} | None                       |

The lowest level objects of the lowest level array has always the same names.

Therefore I want:

| ID  | updated | meta | value1 | value2 |
|-----|---------|------|--------|--------|
| 300 | 23424   | 8    | 1      | 2      |
| 300 | 23424   | 8    | 3      | 4      |
| 300 | 23424   | 9    | 5      | 6      |

Am I on the wrong track?

What would be the most convenient way to solve this?

Karan Gautam · Accepted Answer · 2019-09-25 12:46:48Z

6

@sinB - You can further improve this by removing the for loop (It will cause issue when dealing with database with many documents). You don't need loop anyway as the result can be converted into pandas dataframe with single command.

Instead of this:

#add each doc as a new row in dataframe
for doc in collection.aggregate(pipeline): 
    df = df.append(doc,ignore_index=True)

You can use this

query_result = collection.aggregate(pipeline)
query_result = list(query_result)
df = pd.io.json.json_normalize(query_result)

answered Sep 25, 2019 at 12:46

Karan Gautam

1351 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sinB · Accepted Answer · 2019-05-13 17:43:16Z

5

I realized that mongoDB can do all the heavy lifting.

Working code:

import pandas as pd
from pymongo import MongoClient
mongoClient = MongoClient('localhost', 27017)
db = mongoClient.DB_NAME
collection = db.COLLECTION_NAME

pipeline = [
    {'$match':{'ID':300}},
    {"$unwind":{'path': '$data', 'preserveNullAndEmptyArrays': True}},
    {"$unwind":{'path': '$data.data', 'preserveNullAndEmptyArrays': True}},
    {'$project':{
      'ID':'$ID',
      'updated':"$updated",
      'meta':"$data.meta",
      'value1':"$data.data.value1",
      'value2':"$data.data.value2"
    }}
]

#Make empty dataframe
df = pd.DataFrame() 

#add each doc as a new row in dataframe
for doc in collection.aggregate(pipeline): 
    df = df.append(doc,ignore_index=True)

print(df)

Output:

| ID  | updated | meta | value1 | value2 |
|-----|---------|------|--------|--------|
| 300 | 23424   | 8    | 1      | 2      |
| 300 | 23424   | 8    | 3      | 4      |
| 300 | 23424   | 9    | 5      | 6      |

answered May 13, 2019 at 17:43

sinB

1851 silver badge12 bronze badges

1 Comment

virtuvious Over a year ago

Thanks @sinB .... also the df could be directly created as df = pd.DataFrame(collection.aggregate(pipeline))

sinB · Accepted Answer · 2019-05-13 12:36:55Z

Well I managed to solve it in the most hideous way.

def flatten(data):
    a = {}
    def make_dict(data):
        for i in list(data):
            if isinstance(data[i], list):
                for j in data[i]:
                    make_dict(j)
            else:
                a.update({i:[]})
        return data

    def add_to_dict(data):
        for i in list(data):
            if isinstance(data[i], list):
                for j in data[i]:
                    add_to_dict(j)
            else:
                a[i].append(data[i])
        max = 0
        for i in a:
            if len(a[i]) > max:
                max = len(a[i])
        for i in a:
            if len(a[i]) < max:
                a[i].append(a[i][-1])

    make_dict(data)
    add_to_dict(data)
    return a


pd.DataFrame(flatten(data))

Output:

| ID  | updated | meta | value1 | value2 |
|-----|---------|------|--------|--------|
| 300 | 23424   | 8    | 1      | 2      |
| 300 | 23424   | 8    | 3      | 4      |
| 300 | 23424   | 9    | 5      | 6      |

I can't imagine this is a good solution, so please feel free to help me with a better one.

Collectives™ on Stack Overflow

Convert nested mongoDB document into flat pandas DataFrame (Array of objects within array of objects)

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related