I would like to index a bunch of large pandas dataframes (some million rows and 50 columns) into Elasticsearch.
When looking for examples on how to do this, most people will use elasticsearch-py's bulk helper method, passing it an instance of the Elasticsearch class which handles the connection as well as a list of dictionaries which is created with pandas' dataframe.to_dict(orient='records') method. Metadata can be inserted into the dataframe beforehand as new columns, e.g. df['_index'] = 'my_index' etc.
However, I have reasons not to use the elasticsearch-py library and would like to talk to the Elasticsearch bulk API directly, e.g. via requests or another convenient HTTP library. Besides, df.to_dict() is very slow on large dataframes, unfortunately, and converting a dataframe to a list of dicts which is then serialized to JSON by elasticsearch-py sounds like unnecessary overhead when there is something like dataframe.to_json() which is pretty fast even on large dataframes.
What would be an easy and quick approach of getting a pandas dataframe into the format required by the bulk API? I think a step in the right direction is using dataframe.to_json() as follows:
import pandas as pd
df = pd.DataFrame.from_records([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}, {'a': 5, 'b': 6}])
df
a b
0 1 2
1 3 4
2 5 6
df.to_json(orient='records', lines=True)
'{"a":1,"b":2}\n{"a":3,"b":4}\n{"a":5,"b":6}'
This is now a newline-separated JSON string, however, it is still lacking the metadata. What would be a performing way to get it in there?
edit: For completeness, a metadata JSON document would look like that:
{"index": {"_index": "my_index", "_type": "my_type"}}
Hence, in the end the whole JSON expected by the bulk API would look like this (with an additional linebreak after the last line):
{"index": {"_index": "my_index", "_type": "my_type"}}
{"a":1,"b":2}
{"index": {"_index": "my_index", "_type": "my_type"}}
{"a":3,"b":4}
{"index": {"_index": "my_index", "_type": "my_type"}}
{"a":5,"b":6}