3

I have a big CSV file that I need to cut into different pieces based on the value in one of the columns. My input file dataset.csv is something like this:

NOTE: edited to clarify that data is ,data, no spaces.

action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC

So, to split by action_type I simply do (I need the whole matching line in the resulting file):

awk -F, '$2 ~ /^1$/ {print}' dataset.csv >> 1_dataset.csv
awk -F, '$2 ~ /^2$/ {print}' dataset.csv >> 2_dataset.csv

This works as expected but I am basicaly travesing my original dataset twice. My original dataset is about 5GB and I have 30 action_type categories. I need to do this everyday, so, I need to script the thing to run on its own efficiently.

I tried the following but it does not work:

# This is a file called myFilter.awk

{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}

Then I run it as:

awk -f myFilter.awk dataset.csv

But I get nothing. Literally nothing, no even errors. Which sort of tell me that my code is simply not matching anything or my print / pipe statement is wrong.

6
  • 2
    Is your field separator one , or one , followed by a space? Commented Oct 20, 2020 at 21:10
  • 2
    if you have a space after the comma and before the data (eg, 1 or 2), and your awk input delimiter is just a comma, then your tests become <space>1 == 1, which is 'false'; see this and this for ideas on trimming leading/trailing whitespace Commented Oct 20, 2020 at 21:13
  • @Cyrus, it is a ,. Thanks! Commented Oct 20, 2020 at 21:14
  • @markp-fuso, it does not have spaces. I just edited the question to make it clear. Thanks! Commented Oct 20, 2020 at 21:17
  • 1
    Do you want the header line included in every output file? Do you have GNU awk (awk --version)? Commented Oct 20, 2020 at 22:06

3 Answers 3

5

You may try this awk to do this in a single command:

awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
Sign up to request clarification or add additional context in comments.

6 Comments

Holy cow! This is brilliant @anubhava! It totally worked! Thank you!
That was an awesome way of contents extraction. The trick happens because of print >> fn part, which continuously appends the existed fn files as the original file progress.
Just make sure that any existing output files are deleted before running or >> will append to that existing file...
Hi @dawg. Yes, thanks for the reminder. I am also going to test other methods to find out which one is more efficient. In my 5GB file, this has been running for about 2 hours and I do not see it even close to be done (based on the size of the output files than at then their sum should match the size of the original file).
@Wilmar if you're looking for the fastest solution then update your question to say that. So far no-one has posted the fastest solution and chances are no-one will since you already accepted an answer. Also see my comment under your question and update your question to show the expected output given your posted sample input.
|
4

With GNU awk to handle many concurrently open files and without replicating the header line in each output file:

awk -F',' '{print > ($2 "_dataset.csv")}' dataset.csv

or if you also want the header line to show up in each output file then with GNU awk:

awk -F',' '
    NR==1 { hdr = $0; next }
    !seen[$2]++ { print hdr > ($2 "_dataset.csv") }
    { print > ($2 "_dataset.csv") }
' dataset.csv

or the same with any awk:

awk -F',' '
    NR==1 { hdr = $0; next }
    { out = $2 "_dataset.csv" }
    !seen[$2]++ { print hdr > out }
    { print >> out; close(out) }
' dataset.csv

Comments

1

As currently coded the input field separator has not been defined.

Current:

$ cat myfilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}

Invocation:

$ awk -f myfilter.awk dataset.csv

There are a couple ways to address this:

$ awk -v FS="," -f myfilter.awk dataset.csv

or

$ cat myfilter.awk
BEGIN {FS=","}
{
action_type=$2
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}

$ awk -f myfilter.awk dataset.csv

3 Comments

Thank you very much @markp-fuso. This indeed works. I anubhava 'anubhava' answer as correct for it was beautifully concise and straightforward.
@Wilmar not a problem, anubhava is just too quick on that keyboard of his! :-) and I was more interested in pointing out the issue with the original code
@mark-fuse. Absolutely, that helps a lot! Thanks again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.