Split CSV file in bash into multiple files based on condition

Question

My csv file has multiple rows of data and I want to split it into multiple files based on one attribute.

beeline -u jdbc:hive2:<MYHOST> -n <USER> -p <PASSWORD> --silent=true --outputformat=csv2 -f <SQL FILE> > result_+%Y%m%d_%H%M%S.csv

SQL code with ORDER BY ID is triggered from beeline which creates single CSV.

cat sql.csv
"attr;attr;ID;attr"
"data;data;XXXX;date"
"data;data;XXXX;date"
"data;data;YYYYY;date"
"data;data;YYYYY;date"
"data;data;BBBBB;date"
"data;data;BBBBB;date"

Desired result is to split once new ID is recognised and use that ID in filename.

file_1_ID_XXXX_+%Y%m%d_%H%M%S:

attr   attr    ID  attr
data    data    XXXX    date
data    data    XXXX    date

file_2_ID_YYYYY_+%Y%m%d_%H%M%S:

attr   attr    ID  attr
data    data    YYYYY   date
data    data    YYYYY   date

David C. Rankin · Accepted Answer · 2019-08-05 08:42:25Z

2

If I understand your question, you can take the csv file produced by sql and then split that into the 3 files you show simply by using a few variables, string concatenation and then by redirecting to the output files, e.g.

awk -v field=a -v n=1 -v dt=$(date '+%Y%m%d_%H%M%S') '
    FNR == 1 {hdg=$0; next}
    a != $3 {a = $3; name="file_"n"_ID_"a"_"dt; n++; print hdg > name}
    {print $0 > name}
' sqldata

Example Input File

Where your sqldata file contains:

$ cat sqldata
attr    attr    ID  attr
data    data    XXXX    date
data    data    XXXX    date
data    data    YYYYY   date
data    data    YYYYY   date
data    data    BBBBB   date
data    data    BBBBB   date

Example Use/Output Files

Simply copying and middle-mouse pasting awk script into the terminal with the correct filename to read would produce the following three output files:

$ cat file_1_ID_XXXX_20190805_033514
attr    attr    ID  attr
data    data    XXXX    date
data    data    XXXX    date

$ cat file_2_ID_YYYYY_20190805_033514
attr    attr    ID  attr
data    data    YYYYY   date
data    data    YYYYY   date

$ cat file_3_ID_BBBBB_20190805_033514
attr    attr    ID  attr
data    data    BBBBB   date
data    data    BBBBB   date

Look things over and let me know if this is what you intended. If not, let me know and I'm happy to help further.

answered Aug 5, 2019 at 8:42

David C. Rankin

85.1k6 gold badges67 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

marcin2x4 Over a year ago

I changed a != $3 {a = $3 to a != $1 {a = $1 in case ID was first column but data gets split by different attribute.

David C. Rankin Over a year ago

Unless you have different data, you column 1 just contains the heading attr and then the remaining records all contain data which will not do what you ask?

marcin2x4 Over a year ago

I forgot to mention that CSV file is comma delimited, not in a tabular form. CurrentIy I see odd behaviour (data gets split but completely different attribute which is ` ` seperated date). Probably because of comma delimited seperator ,

marcin2x4 Over a year ago

When I have one attribue ID in my CSV file everything works ok. When I add more attributes the script doesn't parse the data into columns but sees each rows as if it was concatenated.

David C. Rankin Over a year ago

Awesome. good job for getting that sorted. Each little victory is one more aspect of awk learned. Once you get over the initial cryptic look, awk really isn't bad at all to work with.

Collectives™ on Stack Overflow

Split CSV file in bash into multiple files based on condition

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related