I have a folder with few big csv files and I want to have a variable number of almost equally sized CSV files.
At the moment this is my even sized division implementation:
#!/bin/bash
#copy header to all resulting files parts
head -n 1 $1_2021.csv | awk -v NPROC=$(nproc) '{ for (i = 0; i < NPROC; ++i) print $0 > "file_"i".csv" }'
#copy the data but the header for each file part
tail --silent -n+2 $1* | awk -v NPROC=$(nproc) '{ part = NR % NPROC; print $0 >> "file_"part".csv" }'
where $1 is the version of the files, passed as parameter to the bash script, for instance v1 or v2.
The output filenames are not relevant, currently file_"i".csv & file_"part".csv produce the same filenames, where part & i lay in this range: (0, NPROC)
Some samples of the file v1_2020.csv (semicolon delimited)
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-02;y;n;56;130
2020-01-03;y;n;3;153
2020-01-03;r;n;46;192
2020-01-03;b;n;20;241
2020-01-04;w;n;1252;252
2020-01-05;w;n;453;253
2020-01-06;b;y;1;279
2020-01-06;b;n;945;294
Table-wise looks like this:
| DATE | COLOUR | CLOSING | CHANGE |
|---|---|---|---|
| 2020-01-02 | r | n | 4 |
| 2020-01-02 | y | n | 56 |
| 2020-01-03 | y | n | 3 |
| 2020-01-03 | r | n | 46 |
| 2020-01-03 | b | n | 20 |
| 2020-01-03 | w | n | 1252 |
| 2020-01-05 | w | n | 453 |
| 2020-01-06 | b | y | 1 |
| 2020-01-06 | b | n | 945 |
I want to improve this division in such a way that it does not separate into different files the same dates. So it should take into account the DATE column within the CSV file.
Current output with NPROC=2:
file_1.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-03;y;n;3;153
2020-01-03;b;n;20;241
2020-01-05;w;n;453;253
2020-01-06;b;n;945;294
file_2.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;y;n;56;130
2020-01-03;r;n;46;192
2020-01-04;w;n;1252;252
2020-01-06;b;y;1;279
New output with NPROC=2:
Whatever type of uneven splitting into NPROC number of files such that it does not mix up dates into different files.
One date should be just into one file but a file shall contain multiple dates.
For instance, but any other type of splitting into NPROC number of files is fine if it respects the conditions above:
file_1.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-02;y;n;56;130
2020-01-03;y;n;3;153
2020-01-03;r;n;46;192
2020-01-03;b;n;20;241
file_2.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-04;w;n;1252;252
2020-01-05;w;n;453;253
2020-01-06;b;y;1;279
2020-01-06;b;n;945;294
Could you give me any hint regarding a possible solution without using Python but just bash scripting?