Faster way to load data from MongoDB cursor to pandas Dataframe

Question

I'm trying to analyse a facebook conversation with 150k messages (~40MB) that I stored into MongoDB. I noticed that taking the data from MongoDB to pandas was long (it takes ~25s) and I found that data = [msg for msg in cursor] is the step that's slowing the process.

Is there a faster way to transform the MongoDB cursor to a DataFrame?

Here is some of my code:

from pymongo import MongoClient
import pandas as pd

connection = MongoClient(MONGODB_URI)
database = connection[DBS_NAME]
messages = database['messages']
cursor = messages.find(projection=FIELDS)
data = [msg for msg in cursor]
df = pd.DataFrame(data)

I also could replace this step by df = pd.DataFrame(list(cursor)) or df = pd.DataFrame.from_records(cursor) but it still takes 25s.

I'm saying it's slow because I want to make graphs of who sent the most messages and put them in a website. I make the analysis in Python with Flask and I transfer a json containing my processed data to javascript. That way, everytime you enter the site, it makes the data processing and I don't want it to take 25s before showing the graphs.

Sathish · Accepted Answer · 2019-11-28 09:33:56Z

1

Do the aggregation in mongodb instead of flask

You can delegate the heavy lifting to mongodb, bringing data out of mongodb takes I/O and you cant reduce to sub seconds unless your client side and mongo servers are capable of handling 40MB/Sec.

db.getCollection('COLLECTION').aggregate([{$sortByCount: "$FIELD11"}, {$limit : 10}])

This runs in 0.6 secs with ~300k records

edited Nov 28, 2019 at 9:33

answered Nov 28, 2019 at 7:34

Sathish

3044 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Faster way to load data from MongoDB cursor to pandas Dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related