0

I have a script that parses a csv to array with a million rows in it.

I want to batch this with a cronjob. For example every 100.000 rows i want to pause the script and then continue it again to prevent memory leaks etc.

My script for now is looking like this : It's not relevant what is does but how can i loop through this in batches in an cronjob?

Can i just make an cronjob what calls this script every 5 minutes and remembers where the foreach loop is paused?

$csv = file_get_contents(CSV);
$array = array_map("str_getcsv", explode("\n", $csv));

$headers = $array[0];
$number_of_records = count($array);
    for ($i = 1; $i < $number_of_records; $i++) {
      $params['body'][] = [
        'index' => [
          '_index' => INDEX,
          '_type' => TYPE,
          '_id' => $i
        ]
      ];

      // Set the right keys
      foreach ($array[$i] as $key => $value) {
        $array[$i][$headers[$key]] = $value;
        unset($array[$i][$key]);
      }

      // Loop fields
      $params['body'][] = [
        'Inrijdtijd' => $array[$i]['Inrijdtijd'],
        'Uitrijdtijd' => $array[$i]['Uitrijdtijd'],
        'Parkeerduur' => $array[$i]['Parkeerduur'],
        'Betaald' => $array[$i]['Betaald'],
        'bedrag' => $array[$i]['bedrag']
      ];

      // Every 1000 documents stop and send the bulk request
      if ($i % 100000 == 0) {
        $responses = $client->bulk($params);

        // erase the old bulk request
        $params = ['body' => []];

        // unset the bulk response when you are done to save memory
        unset($responses);
      }

      // Send the last batch if it exists
      if (!empty($params['body'])) {
        $responses = $client->bulk($params);
      }
    }
2
  • Where does $array come from? Is it saved again somewhere again? Answering just the question if you can run the script (assuming it resumes) depends quite on such factors Commented Apr 7, 2016 at 10:56
  • To prevent memory leaks, you shouldn't stop every 100.000 rows. You should utilize streams instead. Also, just don't use $responses. You never use it, you just allocate it and unset it. Commented Apr 7, 2016 at 11:12

1 Answer 1

1

In the given code the script will always process from the beginning, since no pointer of some sort is kept.

My suggestion would be to split the CSV file into pieces and let another script parse the pieces one by one (i.e. every 5 minutes). (and delete the file afterwards).

$fp = fopen(CSV, 'r');

$head   = fgets($fp);

$output = [$head];
while (!feof($fp)) {
    $output[] = fgets($fp);

    if (count($output) == 10000) {
        file_put_contents('batches/batch-' . $count . '.csv', implode("\n", $output));
        $count++;

        $output = [$head];
    }
}

if (count($output) > 1) {
    file_put_contents('batches/batch-' . $count . '.csv', implode("\n", $output));
}

Now the original script can process a file every time:

$files = array_diff(scandir('batches/'), ['.', '..']);

if (count($files) > 0) {
    $file = 'batches/' . $files[0];

    // PROCESS FILE

    unlink($file);
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.