3

I run a benchmark on elasticsearch using elasticsearch-php. I compare the time taken by 10 000 index one by one vs 10 000 with bulk of 1 000 documents.

On my vpn server 3 cores 2 Gb mem the performance is quite the same with or without bulk index.

My php code (inspired by à post):

<?php
set_time_limit(0);  //  no timeout
require 'vendor/autoload.php';
$es = new Elasticsearch\Client([
    'hosts'=>['127.0.0.1:9200']
]);
$max = 10000;

// ELASTICSEARCH BULK INDEX
$temps_debut = microtime(true);
for ($i = 0; $i <=  $max; $i++) {
    $params['body'][] = array(
        'index' => array(
            '_index' => 'articles',
            '_type' => 'article',
            '_id' => 'cle' . $i
        )
    );
    $params['body'][] = array(
        'my_field' => 'my_value' . $i
    );
    if ($i % 1000) {   // Every 1000 documents stop and send the bulk request
        $responses = $es->bulk($params);
        $params = array();  // erase the old bulk request    
        unset($responses); // unset  to save memory
    }
}
$temps_fin = microtime(true);
echo 'Elasticsearch bulk: ' . round($i / round($temps_fin - $temps_debut, 4)) . ' per sec <br>';

// ELASTICSEARCH WITHOUT BULK INDEX
$temps_debut = microtime(true);
        for ($i = 1; $i <= $max; $i++) {    
            $params = array();
            $params['index'] = 'my_index';
            $params['type']  = 'my_type';
            $params['id']    = "key".$i;
            $params['body']  = array('testField' => 'valeur'.$i);
            $ret = $es->index($params);
        }
$temps_fin = microtime(true);
echo 'Elasticsearch One by one : ' . round($i / round($temps_fin - $temps_debut, 4)) . 'per sec <br>';
?>

Elasticsearch bulk: 1209 per sec Elasticsearch One by one : 1197per sec

Is there something wrong on my bulk index to obtain better performance ?

Thank's

2
  • I think the problem because you put $es->bulk($params) inside loop, let's try put them outside loop. Commented Jul 29, 2016 at 12:38
  • The speed difference is really noticeable if you're having to send the data over a network. You're not going to manage anything like 1000 per second if you're using cURL to send them one at a time. Commented Feb 12, 2017 at 16:47

1 Answer 1

4

Replace:

if ($i % 1000) {   // Every 1000 documents stop and send the bulk request

with:

if (($i + 1) % 1000 === 0) {   // Every 1000 documents stop and send the bulk request

or you will query for each non-0 value (that is 999 of 1000)... Obviously, this only works if $max is a multiple of 1000.

Also, correct this bug:

for ($i = 0; $i <=  $max; $i++) {

will iterate over $max + 1 items. replace it with:

for ($i = 0; $i < $max; $i++) {

There might also be a problem with how you initialize $params. Shouldn't you set it up outside of the loop and only clean-up the $params['body'] after each ->bulk()? When you reset with $params = array(); you loose all of it.

Also, remember that ES may be distributed over a cluster. Bulk operations can then be distributed to even the workload. So some performance scaling is not visible on a single physical node.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.