I am trying to use the Python ElasticSearch library to read in elasticsearch documents and place them in a spark dataframe. I am able to connect and query using the scan helper function since the query will generate about 2M documents(rows in my dataframe). The issue I am running into is getting the query into a spark dataframe.
This code produces a generator:
result = elasticsearch.helpers.scan(es, index=index, doc_type='_doc', query=query)
I was trying to use a for loop to fill a to collect the generated data into a dictionary:
data = {}
for item in result:
data.append((item['_source']['someField'], item['_source']['someField']))
return data
but I run into errors as I do not think a dictionary can append in this way.
Is there a better way to collect this generated data into a spark dataframe? Note:I am also working on the Databricks platform if that helps.