Efficient search pattern in large CSV file

Question

I recently asked how to use awk to filter and output based on a searched pattern. I received some very useful answers being the one by user @anubhava the one that I found more straightforward and elegant. For the sake of clarity I am going to repeat some information of the original question.

I have a large CSV file (around 5GB) I need to identify 30 categories (in the action_type column) and create a separate file with only the rows matching each category.

My input file dataset.csv is something like this:

action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC

I am using the following to get the results I want (again, this is thanks to @anubhava).

awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file

This works as expected. But I have found it quite slow. It has been running for 14 hours now and, based on the size of the output files compared to the original file, it is not at even 20% of the whole process.

I am running this on a Windows 10 with an AMD Ryzen PRO 3500 200MHz, 4 Cores, 8 Logical Processors with 16GB Memory and an SDD drive. I am using GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0). My CPU is currently at 30% and Memory at 51%. I am running awk inside a Cygwin64 Terminal.

I would love to hear some suggestions on how to improve the speed. As far as I can see it is not a capacity problem. Could it be the fact that this is running inside Cygwin? Is there an alternative solution? I was thinking about Silver Searcher but could not quite workout how to do the same thing awk is doing for me.

As always, I appreciate any advice.

I don't think gawk has a limit on the opened fd-s. try removing the close(fn). Or sort the file based on the 2-nd field and close(fn) only when the value changes... — vgersh99
– vgersh99, Commented Oct 21, 2020 at 13:16
Thanks @vgersh99 I am trying it. I am sorting the file first. You are right, that will probably make it faster. Also, if the file is being close after each found it probably has to go through the whole thing every time. Thanks, trying it now. — Wilmar
– Wilmar, Commented Oct 21, 2020 at 13:34
You should try Ed Morton's solution from your other question.... — dawg
– dawg, Commented Oct 21, 2020 at 13:40
I can actually report results now. I did as @vgersh99 suggested, sorted the file with sort -t, -nk2 dataset.csv > sorted_dataset.csv and then awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' sorted_dataset.csv and it was lightining fast (20 seconds to sort and around 10 seconds for awk). Then I tried without sorting and it was all around 25 seconds! What an improvement! Thank you. Please post it as an answer so I can mark it as correct. — Wilmar
– Wilmar, Commented Oct 21, 2020 at 13:42
@Wilmar, you're still closing the file for each line/record. awk -F, 'NR > 1{if(fn = $2 "_dataset.csv"; if(!seen[$2]++) close(fn);print >> fn}' < (sort -t, -nk2 dataset.csv) — vgersh99
– vgersh99, Commented Oct 21, 2020 at 13:53

vgersh99 · Accepted Answer · 2020-10-21 14:00:43Z

3

with sorting:

awk -F, 'NR > 1{if(!seen[$2]++ && fn) close(fn); if(fn = $2 "_dataset.csv"; print >> fn}'  < (sort -t, -nk2 dataset.csv)

or with gawk (unlimited number of opened fd-s)

gawk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' dataset.csv

answered Oct 21, 2020 at 14:00

vgersh99

9751 gold badge6 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

anubhava Over a year ago

Very good suggestion of using pre-sorted file for non-gnu awk versions ++

Ed Morton · Accepted Answer · 2020-10-22 18:52:22Z

1

This is the right way to do it using any awk:

$ tail -n +2 file | sort -t, -k2,2n |
    awk -F, '$2!=p{close(out); out=$2"_dataset.csv"; p=$2} {print > out}'

The reason I say this is the right approach is it doesn't rely on the 2nd field of the header line coming before the data values when sorted, doesn't require awk to test NR > 1 for every line of input, doesn't need an array to store $2s or any other values, and only keeps 1 output file open at a time (the more files open at once the slower any awk will run, especially gawk once you get past the limit of open files supported by other awks as gawk then has to start opening/closing the files in the background as needed). It also doesn't require you to empty existing output files before you run it, it will do that automatically, and it only does string concatenation to create the output file once per output file, not once per line.

Just like the currently accepted answer, the sort above could reorder the input lines that have the same $2 value - add -s if that's undesirable and you have GNU sort, with other sorts you need to replace the tail with a different awk command and add another sort arg.

edited Oct 22, 2020 at 18:52

answered Oct 22, 2020 at 2:39

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

2 Comments

vgersh99 Over a year ago

completely agree with all the points! But... 1) out fd is out, therefore we need to close(out), not close($2) 2) we need to check if out is not "null" (first record), before closing this fd. My $.02

Ed Morton Over a year ago

@vgersh99 absolutely right and fixed that it's close(out) but we don't need to check it as close(null) simply does nothing.

Collectives™ on Stack Overflow

Efficient search pattern in large CSV file

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related