11

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.

MemoryError:    

My code looks like this:

cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.

However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.

In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?

4 Answers 4

11

I've modified my code to the following:

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.

Sign up to request clarification or add additional context in comments.

Comments

5

The fastest, and likely most memory-efficient way, to create a DataFrame from a mongodb query, as in your case, would be using monary.

This post has a nice and concise explanation.

2 Comments

I've seen this pushed on stackoverflow several times now, but it has almost always been misused. If you check the API, it seems to be pretty clear that the only datatypes allowed are numerical not text. In this case, I assume (due to the inclusion of the term "tweet") that this is not the proper use case of monary.
The monary project, while a good idea, seems to be abandoned, the BitBucket repo is gone, and in GitHub an issue reporting that it not working since Mar 2023 without any activity and last commit 9yold (Aug 2014).
3

The from_records classmethod is probably the best way to do it:

from pandas import pd
import pymongo

client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)

df = pd.DataFrame.from_records(data)

Comments

2

an elegant way of doing it would be as follows:

import pandas as pd
def my_transform_logic(x):
    if x :
        do_something
        return result

def process(cursor):
    df = pd.DataFrame(list(cursor))
    df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value))

    #making list off dictionaries
    db.collection_name.insert_many(final_df.to_dict('records'))

    # or update
    db.collection_name.update_many(final_df.to_dict('records'),upsert=True)


#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(6)
for cursor in cursors:
    process(cursor)

I tried the above process on a mongoDB collection with 2.6 million records using Joblib on the above code. My code didnt throw any memory errors and the processing finished in 2 hrs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.