I recently asked how to use awk to filter and output based on a searched pattern. I received some very useful answers being the one by user @anubhava the one that I found more straightforward and elegant. For the sake of clarity I am going to repeat some information of the original question.
I have a large CSV file (around 5GB) I need to identify 30 categories (in the action_type column) and create a separate file with only the rows matching each category.
My input file dataset.csv is something like this:
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
I am using the following to get the results I want (again, this is thanks to @anubhava).
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
This works as expected. But I have found it quite slow. It has been running for 14 hours now and, based on the size of the output files compared to the original file, it is not at even 20% of the whole process.
I am running this on a Windows 10 with an AMD Ryzen PRO 3500 200MHz, 4 Cores, 8 Logical Processors with 16GB Memory and an SDD drive. I am using GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0). My CPU is currently at 30% and Memory at 51%. I am running awk inside a Cygwin64 Terminal.
I would love to hear some suggestions on how to improve the speed. As far as I can see it is not a capacity problem. Could it be the fact that this is running inside Cygwin? Is there an alternative solution? I was thinking about Silver Searcher but could not quite workout how to do the same thing awk is doing for me.
As always, I appreciate any advice.
gawkhas a limit on the opened fd-s. try removing theclose(fn). Or sort the file based on the 2-nd field andclose(fn)only when the value changes...sort -t, -nk2 dataset.csv > sorted_dataset.csvand thenawk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' sorted_dataset.csvand it was lightining fast (20 seconds to sort and around 10 seconds for awk). Then I tried without sorting and it was all around 25 seconds! What an improvement! Thank you. Please post it as an answer so I can mark it as correct.awk -F, 'NR > 1{if(fn = $2 "_dataset.csv"; if(!seen[$2]++) close(fn);print >> fn}' < (sort -t, -nk2 dataset.csv)