2

I have a folder with few big csv files and I want to have a variable number of almost equally sized CSV files.

At the moment this is my even sized division implementation:

#!/bin/bash

#copy header to all resulting files parts
head -n 1 $1_2021.csv | awk -v NPROC=$(nproc) '{ for (i = 0; i < NPROC; ++i) print $0 > "file_"i".csv" }'

#copy the data but the header for each file part
tail --silent -n+2 $1* | awk -v NPROC=$(nproc) '{ part = NR % NPROC; print $0 >> "file_"part".csv" }'

where $1 is the version of the files, passed as parameter to the bash script, for instance v1 or v2. The output filenames are not relevant, currently file_"i".csv & file_"part".csv produce the same filenames, where part & i lay in this range: (0, NPROC)

Some samples of the file v1_2020.csv (semicolon delimited)

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-02;y;n;56;130  
2020-01-03;y;n;3;153  
2020-01-03;r;n;46;192  
2020-01-03;b;n;20;241  
2020-01-04;w;n;1252;252  
2020-01-05;w;n;453;253  
2020-01-06;b;y;1;279  
2020-01-06;b;n;945;294  

Table-wise looks like this:

DATE COLOUR CLOSING CHANGE
2020-01-02 r n 4
2020-01-02 y n 56
2020-01-03 y n 3
2020-01-03 r n 46
2020-01-03 b n 20
2020-01-03 w n 1252
2020-01-05 w n 453
2020-01-06 b y 1
2020-01-06 b n 945

I want to improve this division in such a way that it does not separate into different files the same dates. So it should take into account the DATE column within the CSV file.

Current output with NPROC=2:

file_1.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-03;y;n;3;153  
2020-01-03;b;n;20;241  
2020-01-05;w;n;453;253  
2020-01-06;b;n;945;294

file_2.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;y;n;56;130  
2020-01-03;r;n;46;192  
2020-01-04;w;n;1252;252  
2020-01-06;b;y;1;279 

New output with NPROC=2:

Whatever type of uneven splitting into NPROC number of files such that it does not mix up dates into different files. One date should be just into one file but a file shall contain multiple dates.

For instance, but any other type of splitting into NPROC number of files is fine if it respects the conditions above:

file_1.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-02;y;n;56;130  
2020-01-03;y;n;3;153  
2020-01-03;r;n;46;192  
2020-01-03;b;n;20;241  

file_2.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-04;w;n;1252;252  
2020-01-05;w;n;453;253  
2020-01-06;b;y;1;279  
2020-01-06;b;n;945;294

Could you give me any hint regarding a possible solution without using Python but just bash scripting?

2
  • They are semicolon formatted @TedLyngmo Commented Jul 13, 2021 at 13:44
  • 1
    edit your question to replace those graphical tables with the raw CSV that you generated them from so that we can copy/paste those files to test a potential solution against so we can help you. Commented Jul 13, 2021 at 14:50

3 Answers 3

3

If you just want to split a csv and add a header to each split, you can do:

awk -v cnt=6 -F ';' 'FNR==1{header=$0; fn=1}
!(FNR%cnt){
    fn++
    print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"}' file

If you want to split contextually based on the date column (assuming already sorted):

awk -v sp=6 -v fn=1 -F ';' 'FNR==1{header=$0}
cnt++>sp && l1!=$1 {
    fn++
    cnt=0
    print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"; l1=$1}' file

Result of second here:

cat *.csv
DATE;COLOUR;CLOSING;CHANGE
2020-01-02;r;n;4
2020-01-02;y;n;56
2020-01-03;y;n;3
2020-01-03;r;n;46
2020-01-03;b;n;20
2020-01-03;w;n;1252
DATE;COLOUR;CLOSING;CHANGE
2020-01-05;w;n;453
2020-01-06;b;y;1
2020-01-06;b;n;945
Sign up to request clarification or add additional context in comments.

3 Comments

Unless I'm misreading, that won't work. First, it resets fn to 1 at the start of every file and thus repeatedly overwrites the same file. But also, the print > will overwrite the whole file with each row... that should be >>.
I want to split it in such a way that: 1) one date is contained just within a file part 2) a file part shall contain multiple dates. This because I need to provide consistency to process these files afterwards in parallel.
@Guido: It would be helpful to show in your question what you WANT as output.
2
awk -F';' -v NPROC=2 '
    NR == 1 {head = $0; next}
    !($1 in dates) {
        n = (n + 1) % NPROC
        file = "out_" n ".csv"
        if (!(file in created)) {
            print head > file
            created[file]
        }
        dates[$1] = file
    }
    { print > dates[$1] }
' v1_2020.csv

Since NPROC = 2, two output files are created:

$ cat out_0.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-03;y;n;3;153
2020-01-03;r;n;46;192
2020-01-03;b;n;20;241
2020-01-05;w;n;453;253

$ cat out_1.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-02;y;n;56;130
2020-01-04;w;n;1252;252
2020-01-06;b;y;1;279
2020-01-06;b;n;945;294

Comments

2

First, processing CSV/TSV files with command-line tools can be tricky. The awk command is the go-to here, but it doesn't have built-in support for quoting; if you have a row like column 1; "column 2 has a ';' in it";column 3, then awk -F';' will see it as $1="column 1", $2="\"column to has a '", $3="'in it\"", $4="column3".

If your data doesn't have anything like that, then it's pretty straightforward. First, you want to write each date to its own file:

 awk -F';'  '{print >>$1".csv"}'

That will get you files named after the date, like 2020-01-02.csv.

Now you can merge those into NPROC files, and as long as you only merge whole files, you won't split data from a given date into multiple files. Here's one simple (and not necessarily elegant!) way to do that:

declare -i lines=$(cat *-*-*.csv | wc -l) chunk cur
(( chunk = lines / NPROC, cur = 1 ))
for f in *-*-*.csv; do
  cat "$f" >>"file_$cur.csv"
  if (( $(wc -l <"file_$cur.csv") >= chunk )); then
     (( cur += 1 ))
  fi
done

1 Comment

Thanks for your input Mark and yes, I should not have the issue you mentioned about the column names. Regarding your proposed solution, I wrote that one date should be only within one file but I also meant that one file may contain multiple dates, since the number of final files I want to output is given by NPROC. Your solution fits my use case anyhow I believe, because this could be the 1st step and then as 2nd step I can merge into NPROC files those files created by your input (1st step). Does it sound correct to you?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.