1

In an other question some people are trying to insert a Pandas DataFrame into MongoDB using Python internal structures (dict, list) Insert a Pandas Dataframe into mongodb using PyMongo

I wonder if we can't insert instead a NumPy rec.array (numpy.recarray) to MongoDB using PyMongo.

That should probably be more efficient because pandas.DataFrame.to_dict use for loops and that very long to process huge volume of data

see https://github.com/pydata/pandas/blob/c45dc762655d7109362fecea05584c72351fdc83/pandas/core/frame.py#L854

In [1]: import pandas as pd
In [2]: import pymongo
In [3]: client = pymongo.MongoClient()
In [4]: collection = client['db_name']['collection_name']
In [5]: df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'])
In [6]: df
Out[6]:
   a  b  c
0  1  2  3
1  4  5  6
In [7]: rec = df.to_records()
In [8]: rec
Out[8]:
rec.array([(0, 1, 2, 3), (1, 4, 5, 6)],
          dtype=[('index', '<i8'), ('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
In [9]: type(rec)
Out[9]: numpy.recarray

but I faced some errors at insert

In [10]: collection.insert(rec)

raised

ValueError: no field of name _id

this

In [11]: collection.insert_many(rec)

raised

TypeError: documents must be a non-empty list

this

In [12]: collection.insert_one(rec)

raised

TypeError: document must be an instance of dict, bson.son.SON, or other type that inherits from collections.MutableMapping

Any idea?

1 Answer 1

3

Odo can do this

In [1]: import pandas as pd
In [2]: import pymongo
In [3]: client = pymongo.MongoClient()
In [4]: collection = client['db_name']['collection_name']

In [5]: df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'])
In [6]: rec = df.to_records(index=False)

In [7]: from odo import odo
In [8]: odo(rec, collection)  # migrate recarray into collection
Out[8]: Collection(Database(MongoClient('localhost', 27017), 'db_name'), 'collection_name')

In [9]: list(collection.find())
Out[9]: 
[{'_id': ObjectId('56801e0bfb5d1b19ff9b9dd3'), 'a': 1, 'b': 2, 'c': 3},
 {'_id': ObjectId('56801e0bfb5d1b19ff9b9dd4'), 'a': 4, 'b': 5, 'c': 6}]

However it just goes through an iterator of dictionaries (and so is as inefficient as the other solutions in this regard). If you really want to send binary data efficiently over then you should look at monary.

But for loops aren't necessarily the bottleneck here. I highly recommend doing some simple benchmarking to verify that converting to Python data structures here is the bottleneck of your application. You may be optimizing prematurely.

Sign up to request clarification or add additional context in comments.

2 Comments

I wonder what path odo use to achieve this ? According what you are saying it's not using monary.
Last time I was involved with odo it would have converted to an iterator of python dicts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.