0

I have query result from elasticsearch in following format:

[

{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},
{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},

]

I am trying to load _source field along with _id also into dataframe.

I tried this:

def fetch_records_from_elasticsearch_index(index, filter_json):
    search_param = prepare_es_body(filter_json_dict=filter_json)
    response = settings.ES.search(index=index, body=search_param, size=10)

    if len(response['hits']['hits']) > 0:
        import pandas as pd

        all_hits = response['hits']['hits']
        # return all_hits
        # export es hits to pandas dataframe
        df = pd.concat(map(pd.DataFrame.from_dict, all_hits), axis=1)['_source'].T

        return df
    else:
        return 0

df contains _source field only, but I also want to add _id field to it.

Here's the df output format:

{

"AdminEdit": [
    "False",
    "False",
    "False",
    "False",        
],
"Group": [
    "Grp2",
    "Grp2",
    "Grp2",
    "Grp2"       
],

}

How can I add _id to it?

1
  • 1
    from pandas.io import json; print (json.json_normalize(response))? Commented May 26, 2020 at 9:19

2 Answers 2

2

There are two approaches to solve this:

  1. direct code

    import pandas as pd
    df = pd.json_normalize(all_hits)
    
  2. improvement to your code

    import json
    import pandas as pd
    df = pd.concat(map(pd.DataFrame.from_dict, all_hits), axis=1)['_source'].T
    df["_id"] = [i["_id"] for i in all_hits]
    

The JSON used is:

all_hits = [

{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdg",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},
{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},

]
Sign up to request clarification or add additional context in comments.

Comments

1

I tried this:

response = '''
[
{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
},
{
    "_index": "product",
    "_type": "_doc",
    "_id": "23234sdf",
    "_score": 2.2295187,
    "_source": {
        "SERP_KEY": "",
        "r_variant_info": "",
        "s_asin": "",
        "pid": "394",
        "r_gtin": "00838128000547",        
        "additional_attributes_remarks": "publisher:0|size:0",            
        "s_gtin": "",            
        "r_category": "",
        "confidence_score": "2.4545",      
        "title_match": "45.45"
    }
}
]
'''

from pandas.io import json as js
import json

data = json.loads(response)
df = js.json_normalize(data)
print(df.columns)

These are the columns that you get in the final dataframe:

Index(['_id', '_index', '_score', '_source.SERP_KEY',
       '_source.additional_attributes_remarks', '_source.confidence_score',
       '_source.pid', '_source.r_category', '_source.r_gtin',
       '_source.r_variant_info', '_source.s_asin', '_source.s_gtin',
       '_source.title_match', '_type'],
      dtype='object')

2 Comments

yes.. the response and output fields may differ. That's why I said format.
IIUC, you wanted the _id and _source fields in the output, I guess the above solution is getting them now?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.