PHP - Count Distinct Value in a CSV file

Question

I have a csv file with a very large number of item (5000 lines) in this format

storeId,bookId,nb
124,48361,0
124,48363,6
125,48362,8
125,48363,2
126,28933,4
142,55433,6
142,55434,10
171,55871,7
171,55872,6

I need to count the number of stores in the file, so for exemple with the line above the result should be 5. But I need to doo it with 5000 lines so I can't just loop.

How can I achieve that?

I also need too return the max quantity, so 10

I began by converting the file into an array:

if (file_exists($file)) {
            $csv = array_map('str_getcsv', file($file));

            #Stores
            $storeIds = array_column($csv, 0);
            $eachStoreNb = array_count_values($storeIds);
            $storeCount = count($eachStoreNb);

        }

        print_r($storeCount);

Is there a better way to do it? Faster ? Maybe without using the array

Faster maybe needing less passes but that would hardly make a significant difference. — mighty_dev
– mighty_dev, Commented Feb 21, 2021 at 12:52
In AWK this is trivial, and can be done with few lines of code. Just add +1 to the key of store. Like: stores[123] +=1 . I am not sure if PHP has similar on-demand keylists, but it could be in that direction you need to look. If PHP is not a demand from you, I would be happy to post an AWK solution. — MyICQ
– MyICQ, Commented Feb 21, 2021 at 12:59
Why do you need this to be faster? How many seconds does it take now? and how many seconds should it take ? — Luuk
– Luuk, Commented Feb 21, 2021 at 13:34

mighty_dev · Accepted Answer · 2021-02-21 13:09:45Z

Faster here would come in the context of micro-optimization, however you can see an improvement in memory usage.

You could just read the file line by line instead of collecting all store IDs in an array and then doing an array_count_values() saving you an extra loop and unnecessary linear storage of all duplicate values.
Store IDs would just be made as a key for an associative array.
For max NB, you can just keep a max variable keeping the track of max value using max() function. Rest is self-explanatory.

Snippet:

<?php

$file = 'test.csv';
if (file_exists($file)) {
    $fp = fopen($file ,'r');
    $max_nb = 0;
    $store_set = [];
    fgetcsv($fp); // ignoring headers
    while(!feof($fp)){
        $row = fgetcsv($fp);
        $store_set[$row[0]] = true;
        $max_nb = max($max_nb,end($row));
    }

    fclose($fp);

    echo "Num Stores : ",count($store_set),"<br/>";
    echo "Max NB : ",$max_nb;

}else{
    echo "No such CSV file found.";
}

Note: For profiling, I suggest you to try both scripts using xdebug

Minh Viet · Accepted Answer · 2021-02-21 13:07:24Z

0

What if you looped through the file line by line? I mean ...

$datas = [];
$handle = fopen("filename.csv", "r");

$flagFirstLine = true;
while(!feof($handle)){
    //dont read first line
    if($flagFirstLine) continue;
    $flagFirstLine = false;

    $csvLine = fgetcsv($handle);
    $storeID = $csvLine[0];

    $datas[] = $storeID;
}

echo "all row: " . count($datas);
echo "\nnum store: " . count(array_unique($datas));

edited Feb 21, 2021 at 13:07

answered Feb 21, 2021 at 13:05

Minh Viet

441 silver badge5 bronze badges

2 Comments

John Conde Over a year ago

Is this faster? Or just a different way to do it? Have you tested both to see if this really makes a difference?

Minh Viet Over a year ago

@nice_dev yeah, i see. I just changed my code. How do you feel?

SirPilan · Accepted Answer · 2021-02-21 13:33:37Z

0

What 'nice_dev' says, but a little more compact.

$fp = fopen('<your_file>', 'r');

fseek($fp, strpos($content, "\n") + 1); // skip first line
$stores = [];
while($row = fgetcsv($fp)) {
    $stores[$row[0]] = max([($stores[$row[0]] ?? 0), $row[2]]);
}

Working example.

answered Feb 21, 2021 at 13:33

SirPilan

4,8772 gold badges15 silver badges28 bronze badges

Comments

Luuk · Accepted Answer · 2021-02-21 13:40:26Z

0

An answer with awk would be:

awk -F, 'BEGIN {getline}
         { a[$1]++; m=$3>m?$3:m }
         END{ for (i in a){ print i, a[i] };
              print "Number of stores",length(a), "max:",m}' testfile

getline to skip the first line
increment the element with the value of the first column $1 in array a with one, and keep the max value in m
loop over the array a and print all counts (optional)
print the total 'Number of stores', and the max value.

output:

124 52
125 52
126 26
142 52
171 52
Number of stores 5 max: 10

edited Feb 21, 2021 at 13:40

answered Feb 21, 2021 at 13:24

Luuk

15.4k5 gold badges28 silver badges44 bronze badges

Comments

MyICQ · Accepted Answer · 2021-02-21 14:12:50Z

0

Solution in AWK, to compare the difference. This includes the count of each store as well. AWK should be able to process millions in less than 1 second. I use the same to filter duplicates from a file.

BEGIN{   # Set some variables initially
  FS=","   # field separator for INPUT
  mymax=0  # init variable mymax
}
NR>1 {            # skip the header line, this matches line 2 onwards
  mycount[$1]++   # increase associative array at that position
  if ($3>mymax){  # compare with max
  mymax=$3
  }

}
END{              # finally print results
  for (i in mycount){
    if (length(i)>0){
       print "value " i  " has " mycount[i]
    }
   }
  print "Maximum value is "  mymax
}

edited Feb 21, 2021 at 14:12

answered Feb 21, 2021 at 13:23

MyICQ

1,1761 gold badge12 silver badges36 bronze badges

2 Comments

Luuk Over a year ago

I am getting Maximum value is nb because nb is the max value for this third column. 😉

MyICQ Over a year ago

Ahh, yes sorry! My test file had the header left out. just change line 5 to read "NR>1 {" to skip the header line. My mistake.

Collectives™ on Stack Overflow

PHP - Count Distinct Value in a CSV file

5 Answers 5

Comments

2 Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related