Split unevenly a CSV file in multiple files in bash scripting

Question

I have a folder with few big csv files and I want to have a variable number of almost equally sized CSV files.

At the moment this is my even sized division implementation:

#!/bin/bash

#copy header to all resulting files parts
head -n 1 $1_2021.csv | awk -v NPROC=$(nproc) '{ for (i = 0; i < NPROC; ++i) print $0 > "file_"i".csv" }'

#copy the data but the header for each file part
tail --silent -n+2 $1* | awk -v NPROC=$(nproc) '{ part = NR % NPROC; print $0 >> "file_"part".csv" }'

where $1 is the version of the files, passed as parameter to the bash script, for instance v1 or v2. The output filenames are not relevant, currently file_"i".csv & file_"part".csv produce the same filenames, where part & i lay in this range: (0, NPROC)

Some samples of the file v1_2020.csv (semicolon delimited)

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-02;y;n;56;130  
2020-01-03;y;n;3;153  
2020-01-03;r;n;46;192  
2020-01-03;b;n;20;241  
2020-01-04;w;n;1252;252  
2020-01-05;w;n;453;253  
2020-01-06;b;y;1;279  
2020-01-06;b;n;945;294

Table-wise looks like this:

DATE	COLOUR	CLOSING	CHANGE
2020-01-02	r	n	4
2020-01-02	y	n	56
2020-01-03	y	n	3
2020-01-03	r	n	46
2020-01-03	b	n	20
2020-01-03	w	n	1252
2020-01-05	w	n	453
2020-01-06	b	y	1
2020-01-06	b	n	945

I want to improve this division in such a way that it does not separate into different files the same dates. So it should take into account the DATE column within the CSV file.

Current output with `NPROC=2`:

file_1.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-03;y;n;3;153  
2020-01-03;b;n;20;241  
2020-01-05;w;n;453;253  
2020-01-06;b;n;945;294

file_2.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;y;n;56;130  
2020-01-03;r;n;46;192  
2020-01-04;w;n;1252;252  
2020-01-06;b;y;1;279

New output with `NPROC=2`:

Whatever type of uneven splitting into NPROC number of files such that it does not mix up dates into different files. One date should be just into one file but a file shall contain multiple dates.

For instance, but any other type of splitting into NPROC number of files is fine if it respects the conditions above:

file_1.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-02;y;n;56;130  
2020-01-03;y;n;3;153  
2020-01-03;r;n;46;192  
2020-01-03;b;n;20;241

file_2.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-04;w;n;1252;252  
2020-01-05;w;n;453;253  
2020-01-06;b;y;1;279  
2020-01-06;b;n;945;294

Could you give me any hint regarding a possible solution without using Python but just bash scripting?

edit your question to replace those graphical tables with the raw CSV that you generated them from so that we can copy/paste those files to test a potential solution against so we can help you. — Ed Morton
– Ed Morton, Commented Jul 13, 2021 at 14:50

dawg · Accepted Answer · 2021-07-13 14:11:38Z

3

If you just want to split a csv and add a header to each split, you can do:

awk -v cnt=6 -F ';' 'FNR==1{header=$0; fn=1}
!(FNR%cnt){
    fn++
    print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"}' file

If you want to split contextually based on the date column (assuming already sorted):

awk -v sp=6 -v fn=1 -F ';' 'FNR==1{header=$0}
cnt++>sp && l1!=$1 {
    fn++
    cnt=0
    print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"; l1=$1}' file

Result of second here:

cat *.csv
DATE;COLOUR;CLOSING;CHANGE
2020-01-02;r;n;4
2020-01-02;y;n;56
2020-01-03;y;n;3
2020-01-03;r;n;46
2020-01-03;b;n;20
2020-01-03;w;n;1252
DATE;COLOUR;CLOSING;CHANGE
2020-01-05;w;n;453
2020-01-06;b;y;1
2020-01-06;b;n;945

edited Jul 13, 2021 at 14:11

answered Jul 13, 2021 at 14:02

dawg

105k24 gold badges142 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mark Reed Over a year ago

Unless I'm misreading, that won't work. First, it resets fn to 1 at the start of every file and thus repeatedly overwrites the same file. But also, the print > will overwrite the whole file with each row... that should be >>.

Guido Over a year ago

I want to split it in such a way that: 1) one date is contained just within a file part 2) a file part shall contain multiple dates. This because I need to provide consistency to process these files afterwards in parallel.

dawg Over a year ago

@Guido: It would be helpful to show in your question what you WANT as output.

glenn jackman · Accepted Answer · 2021-07-13 14:38:42Z

2

awk -F';' -v NPROC=2 '
    NR == 1 {head = $0; next}
    !($1 in dates) {
        n = (n + 1) % NPROC
        file = "out_" n ".csv"
        if (!(file in created)) {
            print head > file
            created[file]
        }
        dates[$1] = file
    }
    { print > dates[$1] }
' v1_2020.csv

Since NPROC = 2, two output files are created:

$ cat out_0.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-03;y;n;3;153
2020-01-03;r;n;46;192
2020-01-03;b;n;20;241
2020-01-05;w;n;453;253

$ cat out_1.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-02;y;n;56;130
2020-01-04;w;n;1252;252
2020-01-06;b;y;1;279
2020-01-06;b;n;945;294

answered Jul 13, 2021 at 14:38

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

Comments

Mark Reed · Accepted Answer · 2021-07-13 14:10:33Z

2

First, processing CSV/TSV files with command-line tools can be tricky. The awk command is the go-to here, but it doesn't have built-in support for quoting; if you have a row like column 1; "column 2 has a ';' in it";column 3, then awk -F';' will see it as $1="column 1", $2="\"column to has a '", $3="'in it\"", $4="column3".

If your data doesn't have anything like that, then it's pretty straightforward. First, you want to write each date to its own file:

 awk -F';'  '{print >>$1".csv"}'

That will get you files named after the date, like 2020-01-02.csv.

Now you can merge those into NPROC files, and as long as you only merge whole files, you won't split data from a given date into multiple files. Here's one simple (and not necessarily elegant!) way to do that:

declare -i lines=$(cat *-*-*.csv | wc -l) chunk cur
(( chunk = lines / NPROC, cur = 1 ))
for f in *-*-*.csv; do
  cat "$f" >>"file_$cur.csv"
  if (( $(wc -l <"file_$cur.csv") >= chunk )); then
     (( cur += 1 ))
  fi
done

edited Jul 13, 2021 at 14:10

answered Jul 13, 2021 at 13:52

Mark Reed

96k17 gold badges149 silver badges189 bronze badges

1 Comment

Guido Over a year ago

Thanks for your input Mark and yes, I should not have the issue you mentioned about the column names. Regarding your proposed solution, I wrote that one date should be only within one file but I also meant that one file may contain multiple dates, since the number of final files I want to output is given by NPROC. Your solution fits my use case anyhow I believe, because this could be the 1st step and then as 2nd step I can merge into NPROC files those files created by your input (1st step). Does it sound correct to you?

Collectives™ on Stack Overflow

Split unevenly a CSV file in multiple files in bash scripting

Current output with `NPROC=2`:

New output with `NPROC=2`:

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Current output with NPROC=2:

New output with NPROC=2:

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related

Current output with `NPROC=2`:

New output with `NPROC=2`: