A better way to load MongoDB data to a DataFrame using Pandas and PyMongo?

Question

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.

MemoryError:

My code looks like this:

cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.

However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.

https://stackoverflow.com/a/13215411/2297475

In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?

blue_chip · Accepted Answer · 2014-07-25 20:53:08Z

11

I've modified my code to the following:

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.

answered Jul 25, 2014 at 20:53

blue_chip

8262 gold badges8 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ankostis · Accepted Answer · 2023-12-11 08:37:15Z

5

The fastest, and likely most memory-efficient way, to create a DataFrame from a mongodb query, as in your case, would be using monary.

This post has a nice and concise explanation.

edited Dec 11, 2023 at 8:37

ankostis

9,6703 gold badges53 silver badges68 bronze badges

answered Jul 26, 2014 at 6:29

shx2

64.8k17 gold badges139 silver badges166 bronze badges

2 Comments

chase Over a year ago

I've seen this pushed on stackoverflow several times now, but it has almost always been misused. If you check the API, it seems to be pretty clear that the only datatypes allowed are numerical not text. In this case, I assume (due to the inclusion of the term "tweet") that this is not the proper use case of monary.

ankostis Over a year ago

The monary project, while a good idea, seems to be abandoned, the BitBucket repo is gone, and in GitHub an issue reporting that it not working since Mar 2023 without any activity and last commit 9yold (Aug 2014).

Edgar Ramírez Mondragón · Accepted Answer · 2019-09-27 15:29:20Z

3

The from_records classmethod is probably the best way to do it:

from pandas import pd
import pymongo

client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)

df = pd.DataFrame.from_records(data)

answered Sep 27, 2019 at 15:29

Edgar Ramírez Mondragón

3,0613 gold badges23 silver badges36 bronze badges

Comments

James Z · Accepted Answer · 2017-09-19 01:11:01Z

an elegant way of doing it would be as follows:

import pandas as pd
def my_transform_logic(x):
    if x :
        do_something
        return result

def process(cursor):
    df = pd.DataFrame(list(cursor))
    df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value))

    #making list off dictionaries
    db.collection_name.insert_many(final_df.to_dict('records'))

    # or update
    db.collection_name.update_many(final_df.to_dict('records'),upsert=True)


#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(6)
for cursor in cursors:
    process(cursor)

I tried the above process on a mongoDB collection with 2.6 million records using Joblib on the above code. My code didnt throw any memory errors and the processing finished in 2 hrs.

Collectives™ on Stack Overflow

A better way to load MongoDB data to a DataFrame using Pandas and PyMongo?

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related