ElasticSearch 7.10.2
Python 3.8.5
elasticsearch-py 7.12.1
I'm trying to do a bulk insert of 100,000 records to ElasticSearch using elasticsearch-py bulk helper.
Here is the Python code:
import sys
import datetime
import json
import os
import logging
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk
# ES Configuration start
es_hosts = [
"http://localhost:9200",]
es_api_user = 'user'
es_api_password = 'pw'
index_name = 'index1'
chunk_size = 10000
errors_before_interrupt = 5
refresh_index_after_insert = False
max_insert_retries = 3
yield_ok = False # if set to False will skip successful documents in the output
# ES Configuration end
# =======================
filename = file.json
logging.info('Importing data from {}'.format(filename))
es = Elasticsearch(
es_hosts,
#http_auth=(es_api_user, es_api_password),
sniff_on_start=True, # sniff before doing anything
sniff_on_connection_fail=True, # refresh nodes after a node fails to respond
sniffer_timeout=60, # and also every 60 seconds
retry_on_timeout=True, # should timeout trigger a retry on different node?
)
def data_generator():
f = open(filename)
for line in f:
yield {**json.loads(line), **{
"_index": index_name,
}}
errors_count = 0
for ok, result in streaming_bulk(es, data_generator(), chunk_size=chunk_size, refresh=refresh_index_after_insert,
max_retries=max_insert_retries, yield_ok=yield_ok):
if ok is not True:
logging.error('Failed to import data')
logging.error(str(result))
errors_count += 1
if errors_count == errors_before_interrupt:
logging.fatal('Too many import errors, exiting with error code')
exit(1)
print("Documents loaded to Elasticsearch")
When the json file contains a small amount of documents (~100), this code runs without issue. But I just tested it with a file of 100k documents, and I got this error:
WARNING:elasticsearch:POST http://127.0.0.1:9200/_bulk?refresh=false [status:N/A request:10.010s]
Traceback (most recent call last):
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 1347, in getresponse
response.begin()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/Users/me/opt/anaconda3/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
response = self.pool.urlopen(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/util/retry.py", line 386, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=10)
I have to admit this one is a bit over my head. I don't typically like to paste large error messages here, but I'm not sure what about this message is relevant.
I can't help but think that I maybe need to adjust some of the params in the es object? Or the configuration variables? I don't know enough about the params to be able to make an educated decision on my own.
And the last but certainly not least point - it looks like some documents were loaded into the ES index nonetheless. But even stranger, the count shows 110k when the json file only has 100k.