awk output to file based on filter

Question

I have a big CSV file that I need to cut into different pieces based on the value in one of the columns. My input file dataset.csv is something like this:

NOTE: edited to clarify that data is ,data, no spaces.

action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC

So, to split by action_type I simply do (I need the whole matching line in the resulting file):

awk -F, '$2 ~ /^1$/ {print}' dataset.csv >> 1_dataset.csv
awk -F, '$2 ~ /^2$/ {print}' dataset.csv >> 2_dataset.csv

This works as expected but I am basicaly travesing my original dataset twice. My original dataset is about 5GB and I have 30 action_type categories. I need to do this everyday, so, I need to script the thing to run on its own efficiently.

I tried the following but it does not work:

# This is a file called myFilter.awk

{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}

Then I run it as:

awk -f myFilter.awk dataset.csv

But I get nothing. Literally nothing, no even errors. Which sort of tell me that my code is simply not matching anything or my print / pipe statement is wrong.

Is your field separator one , or one , followed by a space? — Cyrus
– Cyrus, Commented Oct 20, 2020 at 21:10
if you have a space after the comma and before the data (eg, 1 or 2), and your awk input delimiter is just a comma, then your tests become <space>1 == 1, which is 'false'; see this and this for ideas on trimming leading/trailing whitespace — markp-fuso
– markp-fuso, Commented Oct 20, 2020 at 21:13
@markp-fuso, it does not have spaces. I just edited the question to make it clear. Thanks! — Wilmar
– Wilmar, Commented Oct 20, 2020 at 21:17
Do you want the header line included in every output file? Do you have GNU awk (awk --version)? — Ed Morton
– Ed Morton, Commented Oct 20, 2020 at 22:06

anubhava · Accepted Answer · 2020-10-20 21:16:02Z

5

You may try this awk to do this in a single command:

awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file

answered Oct 20, 2020 at 21:16

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Wilmar Over a year ago

Holy cow! This is brilliant @anubhava! It totally worked! Thank you!

George Vasiliou Over a year ago

That was an awesome way of contents extraction. The trick happens because of print >> fn part, which continuously appends the existed fn files as the original file progress.

dawg Over a year ago

Just make sure that any existing output files are deleted before running or >> will append to that existing file...

Wilmar Over a year ago

Hi @dawg. Yes, thanks for the reminder. I am also going to test other methods to find out which one is more efficient. In my 5GB file, this has been running for about 2 hours and I do not see it even close to be done (based on the size of the output files than at then their sum should match the size of the original file).

Ed Morton Over a year ago

@Wilmar if you're looking for the fastest solution then update your question to say that. So far no-one has posted the fastest solution and chances are no-one will since you already accepted an answer. Also see my comment under your question and update your question to show the expected output given your posted sample input.

|

Ed Morton · Accepted Answer · 2020-10-20 22:12:49Z

4

With GNU awk to handle many concurrently open files and without replicating the header line in each output file:

awk -F',' '{print > ($2 "_dataset.csv")}' dataset.csv

or if you also want the header line to show up in each output file then with GNU awk:

awk -F',' '
    NR==1 { hdr = $0; next }
    !seen[$2]++ { print hdr > ($2 "_dataset.csv") }
    { print > ($2 "_dataset.csv") }
' dataset.csv

or the same with any awk:

awk -F',' '
    NR==1 { hdr = $0; next }
    { out = $2 "_dataset.csv" }
    !seen[$2]++ { print hdr > out }
    { print >> out; close(out) }
' dataset.csv

answered Oct 20, 2020 at 22:12

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Comments

markp-fuso · Accepted Answer · 2020-10-20 21:23:30Z

1

As currently coded the input field separator has not been defined.

Current:

$ cat myfilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}

Invocation:

$ awk -f myfilter.awk dataset.csv

There are a couple ways to address this:

$ awk -v FS="," -f myfilter.awk dataset.csv

or

$ cat myfilter.awk
BEGIN {FS=","}
{
action_type=$2
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}

$ awk -f myfilter.awk dataset.csv

answered Oct 20, 2020 at 21:23

markp-fuso

38.6k5 gold badges24 silver badges48 bronze badges

3 Comments

Wilmar Over a year ago

Thank you very much @markp-fuso. This indeed works. I anubhava 'anubhava' answer as correct for it was beautifully concise and straightforward.

markp-fuso Over a year ago

@Wilmar not a problem, anubhava is just too quick on that keyboard of his! :-) and I was more interested in pointing out the issue with the original code

Wilmar Over a year ago

@mark-fuse. Absolutely, that helps a lot! Thanks again.

Collectives™ on Stack Overflow

awk output to file based on filter

3 Answers 3

6 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related