5

I tried to scroll all documents with python when I query Elasticsearch so I can get over 10K results:

from elasticsearch import Elasticsearch
es = Elasticsearch(ADDRESS, port=PORT)


result = es.search(
    index="INDEX",
    body=es_query,
    size=10000,
    scroll="3m")


scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]
counter = 0
print('total items= ' + scroll_size)

while(scroll_size > 0):
    counter +=len(result['hits']['hits'])
   

    result = es.scroll(scroll_id=scroll_id, scroll="1s")
    scroll_id = result['_scroll_id']

    
print('found = ' +counter)

The problem is that sometimes the counter (the sum of the results at the end of the program) is smaller than result["hits"]["total"]. Why is that? Why does scroll not iterate over all the results?

ElasticSearch version : 5.6
lucence version :6.6
8
  • Try changing scroll_size like this scroll_size = result["hits"]["total"]["value"] Commented Feb 23, 2021 at 10:29
  • @josephthomaa scroll_size = result["hits"]["total"]["value"] TypeError int object is not subscriptable . But is think that the problem is not in total items that the right number , the problem is in the scroll Commented Feb 23, 2021 at 12:13
  • What ES version are you running? Commented Feb 25, 2021 at 9:45
  • @JoeSorocin 5.6 Commented Feb 25, 2021 at 9:50
  • I tried your code and couldn't replicate your error. Can you provide some more detail? How large was the difference? Commented Feb 25, 2021 at 11:29

2 Answers 2

3

If I'm not mistaken, you're adding the initial result["hits"]["total"] to your counter in the first iteration of the while loop -- but you should be adding just the length of the retrieved hits:

scroll_id = result['_scroll_id']
total = result["hits"]["total"]

print('total = %d' % total)

scroll_size = len(result["hits"]["hits"])  # this is the current 'page' size
counter = 0

while(scroll_size > 0):
    counter += scroll_size

    result = es.scroll(scroll_id=scroll_id, scroll="1s")
    scroll_id = result['_scroll_id']
    scroll_size = len(result['hits']['hits'])

print('counter = %d' % counter)
assert counter == total

As a matter of fact, you don't need to store the scroll size separately -- a more concise while loop would be:

while len(result['hits']['hits']):
    counter += len(result['hits']['hits'])

    result = es.scroll(scroll_id=scroll_id, scroll="1s")
    scroll_id = result['_scroll_id']
Sign up to request clarification or add additional context in comments.

1 Comment

I change my code and still I got into found = 800K and in total items = 1.6M items
0

Because the 1st Iteration is having 10K(generally default) like here. You missed: result["hits"]["hits"] chunk

you should try:

counter +=len(result['hits']['hits'])

enter image description here

1 Comment

I change my code and still I got into found = 800K and in total items = 1.6M items

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.