I'm trying to analyse a facebook conversation with 150k messages (~40MB) that I stored into MongoDB. I noticed that taking the data from MongoDB to pandas was long (it takes ~25s) and I found that data = [msg for msg in cursor] is the step that's slowing the process.
Is there a faster way to transform the MongoDB cursor to a DataFrame?
Here is some of my code:
from pymongo import MongoClient
import pandas as pd
connection = MongoClient(MONGODB_URI)
database = connection[DBS_NAME]
messages = database['messages']
cursor = messages.find(projection=FIELDS)
data = [msg for msg in cursor]
df = pd.DataFrame(data)
I also could replace this step by df = pd.DataFrame(list(cursor)) or df = pd.DataFrame.from_records(cursor) but it still takes 25s.
I'm saying it's slow because I want to make graphs of who sent the most messages and put them in a website. I make the analysis in Python with Flask and I transfer a json containing my processed data to javascript. That way, everytime you enter the site, it makes the data processing and I don't want it to take 25s before showing the graphs.