19

I am brand new to using Elasticsearch and I'm having an issue getting all results back when I run an Elasticsearch query through my Python script. My goal is to query an index ("my_index" below), take those results, and put them into a pandas DataFrame which goes through a Django app and eventually ends up in a Word document.

My code is:

es = Elasticsearch()
logs_index = "my_index"
logs = es.search(index=logs_index,body=my_query)

and it tells me I have 72 hits, but then when I do:

df = logs['hits']['hits']
len(df)

It says the length is only 10. I saw someone had a similar issue on this question but their solution did not work for me.

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch()
logs_index = "my_index"
search = Search(using=es)
total = search.count()
search = search[0:total]
logs = es.search(index=logs_index,body=my_query)
len(logs['hits']['hits'])

The len function still says I only have 10 results. What am I doing wrong, or what else can I do to get all 72 results back?

ETA: I am aware that I can just add "size": 10000 to my query to stop it from truncating to just 10, but since the user will be entering their search query I need to find another way that isn't just in the search query.

2
  • 1
    can you please clarify your last edit? I'm not sure what the search query has to do with the size parameter. Are you referring to the problem of not knowing how many results are being returned by the query VS the static size you might define? Commented Dec 11, 2018 at 18:16
  • since it's your first post, please read this so you know how to react to answers: stackoverflow.com/help/someone-answers Commented Dec 11, 2018 at 18:29

4 Answers 4

23

You need to pass a size parameter to your es.search() call.

Please read the API Docs

size – Number of hits to return (default: 10)

An example:

es.search(index=logs_index, body=my_query, size=1000)

Please note that this is not an optimal way to get all index documents or a query that returns a lot of documents. For that you should do a scroll operation which is also documented in the API Docs provided under the scan() abstraction for scroll Elastic Operation.

You can also read about it in elasticsearch documentation

Sign up to request clarification or add additional context in comments.

4 Comments

I thought the size could only be within my_query, thank you for clarifying! I know it's not the best practice, but I need it to just work for now and I can look into scroll later. Thank you!
If for some reason you need to implement a basic and not adviseable client side scroll, you can also use the from parameter that defines the start of your result offset (which lets you paginate results basically).
@AlexandreJuma is it possible to add from as I am trying to add it and my python is giving me SyntaxError probably because from is reserved keyword in python
@thakurinbox, for compatibility with the Python ecosystem we use from_ instead of from and doc_type instead of type as parameter names
9

It is also possible to use the elasticsearch_dsl (link) library:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import pandas as pd

client = Elasticsearch()
s = Search(using=client, index="my_index")

df = pd.DataFrame([hit.to_dict() for hit in s.scan()])

The secret here is s.scan() which handles pagination and queries the entire index.

Note that the example above will return the entire index since it was not passed any query. To create a query with elasticsearch_dsl check this link.

Comments

3

Either you should set the size explicitly(if the number of documents is relatively small) or use the scan function to have a cursor like for large number of documents.

Scan

Comments

0

This python script will help you to execute a combine queries to paginate in elastic search queries and export. #elasticsearch #python script.

from elasticsearch import Elasticsearch, RequestsHttpConnection, helpers
from requests_aws4auth import AWS4Auth
import pandas as pd
# es = Elasticsearch(hosts=[AWSSEARCHURI], (access_key, secret_key))
# es.indices.exists(index="xxxx")

access_key='xxxxx'
secret_key='xxxxx'
region_name='xxxx'
AWSSEARCHURI='xxxxxx'
awsauth = AWS4Auth(access_key, secret_key, region_name, 'es')

es = Elasticsearch(
    hosts=[{'host': AWSSEARCHURI, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)


datas = []
loop=0
timestamp = None
while loop < 161:
    timestamp = timestamp
    print('timestamp:', timestamp)
    if timestamp == None:
        body = {
            "query": {
                "bool": {                    
                    "filter": [
                        {
                            "terms": {
                                "organization_industries.keyword": [
                                    "Crypto",
                                    "crypto",
                                    "Crypto Industry",
                                    "crypto industry"
                                ]
                            }
                        }
                    ]
                }
            },
            "sort": [
                {
                    "@timestamp": "desc"
                }
            ],
            "_source": [
                "contact_id",
                "person_name",                
                "person_email", 
                "type",               
                "@timestamp"
            ]
        }
    else:
        body = {
            "query": {
                "bool": {                    
                    "filter": [
                        {
                            "terms": {
                                 "organization_industries.keyword": [
                                    "Crypto",
                                    "crypto",
                                    "Crypto Industry",
                                    "crypto industry"
                                ]
                            }
                        }
                    ]
                }
            },
            "sort": [
                {
                    "@timestamp": "desc"
                }
            ],
            "search_after":[timestamp],
            "_source": [
                "contact_id",
                "person_name",
                "person_email", 
                "type",               
                "@timestamp"
            ]
        }
    res = es.search(body=body, size=10000, request_timeout=110)
    data = res['hits']['hits']
    print(data[-1]['_source']['@timestamp'])
    timestamp = data[-1]['_source']['@timestamp']
    loop += 1
    datas.extend(data)

csv_data = [dta['_source'] for dta in datas]


cs = pd.DataFrame(csv_data)

cs.to_csv('data_extractor_crypto_industry_06_10_2021.csv', index=False)

2 Comments

This python script will help you to execute a combine queries to paginate in elastic search queries and export. #elasticsearch #python script
elasticsearchPythonSearcher\main.py:112: DeprecationWarning: The 'body' parameter is deprecated for the 'search' API and will be removed in a future version. Instead use API parameters directly. See github.com/elastic/elasticsearch-py/issues/1698 for more information res = es.search(body=body, size=10000, request_timeout=110)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.