python create dataframe from elasticsearch result

Question

I have query result from elasticsearch in following format:

[

{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},
{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},

]

I am trying to load _source field along with _id also into dataframe.

I tried this:

def fetch_records_from_elasticsearch_index(index, filter_json):
    search_param = prepare_es_body(filter_json_dict=filter_json)
    response = settings.ES.search(index=index, body=search_param, size=10)

    if len(response['hits']['hits']) > 0:
        import pandas as pd

        all_hits = response['hits']['hits']
        # return all_hits
        # export es hits to pandas dataframe
        df = pd.concat(map(pd.DataFrame.from_dict, all_hits), axis=1)['_source'].T

        return df
    else:
        return 0

df contains _source field only, but I also want to add _id field to it.

Here's the df output format:

{

"AdminEdit": [
    "False",
    "False",
    "False",
    "False",        
],
"Group": [
    "Grp2",
    "Grp2",
    "Grp2",
    "Grp2"       
],

}

How can I add _id to it?

from pandas.io import json; print (json.json_normalize(response))? — Henry Yik
– Henry Yik, Commented May 26, 2020 at 9:19

Bishwo Adhikari · Accepted Answer · 2020-05-26 10:01:35Z

There are two approaches to solve this:

direct code

import pandas as pd
df = pd.json_normalize(all_hits)

improvement to your code

import json
import pandas as pd
df = pd.concat(map(pd.DataFrame.from_dict, all_hits), axis=1)['_source'].T
df["_id"] = [i["_id"] for i in all_hits]

The JSON used is:

all_hits = [

{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdg",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},
{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},

]

Anshul · Accepted Answer · 2020-05-26 09:33:03Z

1

I tried this:

response = '''
[
{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},
{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
}
]
'''

from pandas.io import json as js
import json

data = json.loads(response)
df = js.json_normalize(data)
print(df.columns)

These are the columns that you get in the final dataframe:

Index(['_id', '_index', '_score', '_source.SERP_KEY',
       '_source.additional_attributes_remarks', '_source.confidence_score',
       '_source.pid', '_source.r_category', '_source.r_gtin',
       '_source.r_variant_info', '_source.s_asin', '_source.s_gtin',
       '_source.title_match', '_type'],
      dtype='object')

answered May 26, 2020 at 9:33

Anshul

1,4232 gold badges8 silver badges15 bronze badges

2 Comments

Azima Over a year ago

yes.. the response and output fields may differ. That's why I said format.

Anshul Over a year ago

IIUC, you wanted the _id and _source fields in the output, I guess the above solution is getting them now?

Collectives™ on Stack Overflow

python create dataframe from elasticsearch result

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related