0

I have a csv file with a very large number of item (5000 lines) in this format

storeId,bookId,nb
124,48361,0
124,48363,6
125,48362,8
125,48363,2
126,28933,4
142,55433,6
142,55434,10
171,55871,7
171,55872,6

I need to count the number of stores in the file, so for exemple with the line above the result should be 5. But I need to doo it with 5000 lines so I can't just loop.

How can I achieve that?

I also need too return the max quantity, so 10

I began by converting the file into an array:

if (file_exists($file)) {
            $csv = array_map('str_getcsv', file($file));

            #Stores
            $storeIds = array_column($csv, 0);
            $eachStoreNb = array_count_values($storeIds);
            $storeCount = count($eachStoreNb);

        }

        print_r($storeCount);

Is there a better way to do it? Faster ? Maybe without using the array

4
  • Faster maybe needing less passes but that would hardly make a significant difference. Commented Feb 21, 2021 at 12:52
  • 1
    In AWK this is trivial, and can be done with few lines of code. Just add +1 to the key of store. Like: stores[123] +=1 . I am not sure if PHP has similar on-demand keylists, but it could be in that direction you need to look. If PHP is not a demand from you, I would be happy to post an AWK solution. Commented Feb 21, 2021 at 12:59
  • stackoverflow.com/questions/4610959/… Commented Feb 21, 2021 at 13:19
  • Why do you need this to be faster? How many seconds does it take now? and how many seconds should it take ? Commented Feb 21, 2021 at 13:34

5 Answers 5

3

Faster here would come in the context of micro-optimization, however you can see an improvement in memory usage.

  • You could just read the file line by line instead of collecting all store IDs in an array and then doing an array_count_values() saving you an extra loop and unnecessary linear storage of all duplicate values.
  • Store IDs would just be made as a key for an associative array.
  • For max NB, you can just keep a max variable keeping the track of max value using max() function. Rest is self-explanatory.

Snippet:

<?php

$file = 'test.csv';
if (file_exists($file)) {
    $fp = fopen($file ,'r');
    $max_nb = 0;
    $store_set = [];
    fgetcsv($fp); // ignoring headers
    while(!feof($fp)){
        $row = fgetcsv($fp);
        $store_set[$row[0]] = true;
        $max_nb = max($max_nb,end($row));
    }

    fclose($fp);

    echo "Num Stores : ",count($store_set),"<br/>";
    echo "Max NB : ",$max_nb;

}else{
    echo "No such CSV file found.";
}

Note: For profiling, I suggest you to try both scripts using xdebug

Sign up to request clarification or add additional context in comments.

Comments

0

What if you looped through the file line by line? I mean ...

$datas = [];
$handle = fopen("filename.csv", "r");

$flagFirstLine = true;
while(!feof($handle)){
    //dont read first line
    if($flagFirstLine) continue;
    $flagFirstLine = false;

    $csvLine = fgetcsv($handle);
    $storeID = $csvLine[0];

    $datas[] = $storeID;
}

echo "all row: " . count($datas);
echo "\nnum store: " . count(array_unique($datas));

2 Comments

Is this faster? Or just a different way to do it? Have you tested both to see if this really makes a difference?
@nice_dev yeah, i see. I just changed my code. How do you feel?
0

What 'nice_dev' says, but a little more compact.

$fp = fopen('<your_file>', 'r');

fseek($fp, strpos($content, "\n") + 1); // skip first line
$stores = [];
while($row = fgetcsv($fp)) {
    $stores[$row[0]] = max([($stores[$row[0]] ?? 0), $row[2]]);
}

Working example.

Comments

0

An answer with awk would be:

awk -F, 'BEGIN {getline}
         { a[$1]++; m=$3>m?$3:m }
         END{ for (i in a){ print i, a[i] };
              print "Number of stores",length(a), "max:",m}' testfile
  1. getline to skip the first line
  2. increment the element with the value of the first column $1 in array a with one, and keep the max value in m
  3. loop over the array a and print all counts (optional)
  4. print the total 'Number of stores', and the max value.

output:

124 52
125 52
126 26
142 52
171 52
Number of stores 5 max: 10

Comments

0

Solution in AWK, to compare the difference. This includes the count of each store as well. AWK should be able to process millions in less than 1 second. I use the same to filter duplicates from a file.

BEGIN{   # Set some variables initially
  FS=","   # field separator for INPUT
  mymax=0  # init variable mymax
}
NR>1 {            # skip the header line, this matches line 2 onwards
  mycount[$1]++   # increase associative array at that position
  if ($3>mymax){  # compare with max
  mymax=$3
  }

}
END{              # finally print results
  for (i in mycount){
    if (length(i)>0){
       print "value " i  " has " mycount[i]
    }
   }
  print "Maximum value is "  mymax
}

2 Comments

I am getting Maximum value is nb because nb is the max value for this third column. 😉
Ahh, yes sorry! My test file had the header left out. just change line 5 to read "NR>1 {" to skip the header line. My mistake.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.