4

I am trying to convert mongoDB documents into a flat pandas dataframe structure.

An example of my mongoDB collection structure:

data = collection.find_one({'ID':300})
print(data)

{'_id': "ObjectId('5cd932299f6b7d4c9b95af6c')",
 'ID': 300,
 'updated': 23424,
 'data': [
     { 'meta': 8,
       'data': [
           {'value1': 1, 'value2': 2}, 
           {'value1': 3, 'value2': 4}
       ]
     },
     { 'meta': 9,
       'data': [
           {'value1': 5, 'value2': 6}
       ]
     }
  ]
}

When i put this into a pandas dataframe, I get

df = pd.DataFrame(data)
print(df)

| _id                      | ID  | updated | data                                              
|
|--------------------------|-----|---------|------------------------ ---------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | {'meta': 8, 'data': [{'value1': 1, 'value2': 2... |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | {'meta': 9, 'data': [{'value1': 5, 'value2': 6}]} |

When I iterate through the dataframe with pd.concat I get

df.rename(columns={'data':'data1'}, inplace=True)
df2 = pd.concat([df, pd.DataFrame(list(df['data1']))], axis=1).drop('data1', 1)
df3 = pd.concat([df2, pd.DataFrame(list(df2['data']))], axis=1).drop('data', 1)
print(df3)

| _id                      | ID  | updated | meta | 0                          | 1                          |
|--------------------------|-----|---------|------|----------------------------|----------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | 8    | {'value1': 1, 'value2': 2} | {'value1': 3, 'value2': 4} |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | 9    | {'value1': 5, 'value2': 6} | None                       |

The lowest level objects of the lowest level array has always the same names.

Therefore I want:

| ID  | updated | meta | value1 | value2 |
|-----|---------|------|--------|--------|
| 300 | 23424   | 8    | 1      | 2      |
| 300 | 23424   | 8    | 3      | 4      |
| 300 | 23424   | 9    | 5      | 6      |

Am I on the wrong track?

What would be the most convenient way to solve this?

3 Answers 3

6

@sinB - You can further improve this by removing the for loop (It will cause issue when dealing with database with many documents). You don't need loop anyway as the result can be converted into pandas dataframe with single command.

Instead of this:

#add each doc as a new row in dataframe
for doc in collection.aggregate(pipeline): 
    df = df.append(doc,ignore_index=True)

You can use this

query_result = collection.aggregate(pipeline)
query_result = list(query_result)
df = pd.io.json.json_normalize(query_result)
Sign up to request clarification or add additional context in comments.

Comments

5

I realized that mongoDB can do all the heavy lifting.

Working code:

import pandas as pd
from pymongo import MongoClient
mongoClient = MongoClient('localhost', 27017)
db = mongoClient.DB_NAME
collection = db.COLLECTION_NAME

pipeline = [
    {'$match':{'ID':300}},
    {"$unwind":{'path': '$data', 'preserveNullAndEmptyArrays': True}},
    {"$unwind":{'path': '$data.data', 'preserveNullAndEmptyArrays': True}},
    {'$project':{
      'ID':'$ID',
      'updated':"$updated",
      'meta':"$data.meta",
      'value1':"$data.data.value1",
      'value2':"$data.data.value2"
    }}
]

#Make empty dataframe
df = pd.DataFrame() 

#add each doc as a new row in dataframe
for doc in collection.aggregate(pipeline): 
    df = df.append(doc,ignore_index=True)

print(df)

Output:

| ID  | updated | meta | value1 | value2 |
|-----|---------|------|--------|--------|
| 300 | 23424   | 8    | 1      | 2      |
| 300 | 23424   | 8    | 3      | 4      |
| 300 | 23424   | 9    | 5      | 6      |

1 Comment

Thanks @sinB .... also the df could be directly created as df = pd.DataFrame(collection.aggregate(pipeline))
0

Well I managed to solve it in the most hideous way.

def flatten(data):
    a = {}
    def make_dict(data):
        for i in list(data):
            if isinstance(data[i], list):
                for j in data[i]:
                    make_dict(j)
            else:
                a.update({i:[]})
        return data

    def add_to_dict(data):
        for i in list(data):
            if isinstance(data[i], list):
                for j in data[i]:
                    add_to_dict(j)
            else:
                a[i].append(data[i])
        max = 0
        for i in a:
            if len(a[i]) > max:
                max = len(a[i])
        for i in a:
            if len(a[i]) < max:
                a[i].append(a[i][-1])

    make_dict(data)
    add_to_dict(data)
    return a


pd.DataFrame(flatten(data))

Output:

| ID  | updated | meta | value1 | value2 |
|-----|---------|------|--------|--------|
| 300 | 23424   | 8    | 1      | 2      |
| 300 | 23424   | 8    | 3      | 4      |
| 300 | 23424   | 9    | 5      | 6      |

I can't imagine this is a good solution, so please feel free to help me with a better one.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.