I need to get a lot of data from Elasticsearch (es), so I'm using the scan command which is a wrap-up for the native es scroll command.
As a result I will get the following generator Object: <generator object scan at 0x000001BF5A25E518>. Farther more, I'd like to insert all the data into a Pandas DataFrame object so I can easily process it.
Code goes as follows:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan as escan
import pandas as pd
es = Elasticsearch(dpl_server, verify_certs=False)
body = {
"size": 1000,
"query": {
"match_all": {}
}
}
response = escan(client=es,
index="index-*,
query=body, request_timeout=30, size=1000)
print(response)
#<generator object scan at 0x000001BF5A25E518>
What I want to do is putting all the results in Pandas DataFrame. If I print each element in the generator as follows:
for res in response:
print(res['_source'])
# { .... }
# { .... }
# { .... }
I will get a lot of dictionaries. A naive solution of mine so far is to add them 1 by 1 like so:
df = None
for res in response:
if (df is None):
df = pd.DataFrame([res['_source']])
else:
df = pd.concat([df, pd.DataFrame([res['_source']])], sort=True)
I wish to know if there's a better way in doing so (first, in terms of speed, second, in terms of clean code). For instance, would it be better to accumulate all the results from the generator into a list and then build a complete DataFrame ?