Scroll in python Elasticsearch not working

Question

I tried to scroll all documents with python when I query Elasticsearch so I can get over 10K results:

from elasticsearch import Elasticsearch
es = Elasticsearch(ADDRESS, port=PORT)


result = es.search(
    index="INDEX",
    body=es_query,
    size=10000,
    scroll="3m")


scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]
counter = 0
print('total items= ' + scroll_size)

while(scroll_size > 0):
    counter +=len(result['hits']['hits'])
   

    result = es.scroll(scroll_id=scroll_id, scroll="1s")
    scroll_id = result['_scroll_id']

    
print('found = ' +counter)

The problem is that sometimes the counter (the sum of the results at the end of the program) is smaller than result["hits"]["total"]. Why is that? Why does scroll not iterate over all the results?

ElasticSearch version : 5.6
lucence version :6.6

Try changing scroll_size like this scroll_size = result["hits"]["total"]["value"] — josephthomaa
– josephthomaa, Commented Feb 23, 2021 at 10:29
@josephthomaa scroll_size = result["hits"]["total"]["value"] TypeError int object is not subscriptable . But is think that the problem is not in total items that the right number , the problem is in the scroll — MicrosoctCprog
– MicrosoctCprog, Commented Feb 23, 2021 at 12:13
I tried your code and couldn't replicate your error. Can you provide some more detail? How large was the difference? — Jozef - Spatialized.io
– Jozef - Spatialized.io, Commented Feb 25, 2021 at 11:29

Jozef - Spatialized.io · Accepted Answer · 2021-02-25 16:19:17Z

3

If I'm not mistaken, you're adding the initial result["hits"]["total"] to your counter in the first iteration of the while loop -- but you should be adding just the length of the retrieved hits:

scroll_id = result['_scroll_id']
total = result["hits"]["total"]

print('total = %d' % total)

scroll_size = len(result["hits"]["hits"])  # this is the current 'page' size
counter = 0

while(scroll_size > 0):
    counter += scroll_size

    result = es.scroll(scroll_id=scroll_id, scroll="1s")
    scroll_id = result['_scroll_id']
    scroll_size = len(result['hits']['hits'])

print('counter = %d' % counter)
assert counter == total

As a matter of fact, you don't need to store the scroll size separately -- a more concise while loop would be:

while len(result['hits']['hits']):
    counter += len(result['hits']['hits'])

    result = es.scroll(scroll_id=scroll_id, scroll="1s")
    scroll_id = result['_scroll_id']

answered Feb 25, 2021 at 16:19

Jozef - Spatialized.io

17k4 gold badges29 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MicrosoctCprog Over a year ago

I change my code and still I got into found = 800K and in total items = 1.6M items

DARK_C0D3R · Accepted Answer · 2021-03-02 11:01:08Z

0

Because the 1st Iteration is having 10K(generally default) like here. You missed: result["hits"]["hits"] chunk

you should try:

counter +=len(result['hits']['hits'])

edited Mar 2, 2021 at 11:01

answered Mar 2, 2021 at 10:54

DARK_C0D3R

2,27721 silver badges23 bronze badges

1 Comment

MicrosoctCprog Over a year ago

I change my code and still I got into found = 800K and in total items = 1.6M items

Collectives™ on Stack Overflow

Scroll in python Elasticsearch not working

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related