4

I'm trying to analyse a facebook conversation with 150k messages (~40MB) that I stored into MongoDB. I noticed that taking the data from MongoDB to pandas was long (it takes ~25s) and I found that data = [msg for msg in cursor] is the step that's slowing the process.

Is there a faster way to transform the MongoDB cursor to a DataFrame?

Here is some of my code:

from pymongo import MongoClient
import pandas as pd

connection = MongoClient(MONGODB_URI)
database = connection[DBS_NAME]
messages = database['messages']
cursor = messages.find(projection=FIELDS)
data = [msg for msg in cursor]
df = pd.DataFrame(data)

I also could replace this step by df = pd.DataFrame(list(cursor)) or df = pd.DataFrame.from_records(cursor) but it still takes 25s.

I'm saying it's slow because I want to make graphs of who sent the most messages and put them in a website. I make the analysis in Python with Flask and I transfer a json containing my processed data to javascript. That way, everytime you enter the site, it makes the data processing and I don't want it to take 25s before showing the graphs.

1 Answer 1

1

Do the aggregation in mongodb instead of flask

You can delegate the heavy lifting to mongodb, bringing data out of mongodb takes I/O and you cant reduce to sub seconds unless your client side and mongo servers are capable of handling 40MB/Sec.

db.getCollection('COLLECTION').aggregate([{$sortByCount: "$FIELD11"}, {$limit : 10}])

This runs in 0.6 secs with ~300k records

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.