Combining multiple csv files in bash, splitting the lines into different rows

Question

I've gotten >100 csv files, each containing >1000 measurements structured like the following two example files

MR44825_radiomics_MCA.csv

Case-1_Image: MR44825_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -917.2822725885565

MR47987_radiomics_MCA.csv

Case-1_Image: MR47987_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -442.31589128260026

The label is always some string of varying length, the distincter to the measurement is always the first :. Each measurement contains identical labels. The measurements themselves may contain , but then the related values are enclosed by {}.

Now I want to merge these files, preferably using bash. The output csv should be structured like the following:

Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026

We encourage questioners to show what they have tried so far to solve the problem themselves. — Cyrus
– Cyrus, Commented Jan 1, 2021 at 15:18
There is no CSV input file in your question. Please edit your question to contain a minimal reproducible example with concise, testable sample input (at least 2 CSVs since you're asking for help to merge multiple files) and the expected output given that input. — Ed Morton
– Ed Morton, Commented Jan 1, 2021 at 15:40
As you have commas in one of your datafield (json), this will not work at the end. — Marco
– Marco, Commented Jan 1, 2021 at 15:58

markp-fuso · Accepted Answer · 2021-01-01 20:04:14Z

2

Assumptions:

OP has a reason for using a non-JSON format (and OP is 'ok' with having commas (,) as both delimiter and data)
all source files have the same number of lines
there are no blank lines in any of the source files
all source files have the same labels preceding the first :
all source files have their labels in the same order
the number, and spelling, of labels is not known up front (ie, we'll need to dynamically parse, store and print the labels)

One awk idea:

NOTE: a bit lengthy due to need to dynamically process labels

awk '
BEGIN  { split("",hdr)                        # declare hdr as an array
         split("",data)                       # declare data as an array
         ndx=1                                # init array index
       }

function print_row() {                        # function to print a row

pfx=""                                        # first column will have a NULL prefix

if ( length(hdr) > 0  )                       # print the header row?
   { for ( i in hdr )
       { printf "%s%s", pfx, hdr[i]
         pfx=","                              # 2nd-nth columns will have a "," prefix
       }
     printf "\n"
     split("",hdr)                            # clear hdr[] array so we do not print it again
   }

pfx=""                                        # reset prefix for printing data row

if ( length(data) > 0 )                       # print a data row?
   { for ( i in data )
         { printf "%s%s", pfx, data[i]
           pfx=","                            # 2nd-nth columns will have a "," prefix
         }
     printf "\n"
     split("",data)                           # clear the data[] array for the next file
     ndx=1                                    # reset our array index for the next file
   }
}

FNR==1 { print_row() }                        # if this is a new file then print contents of last file

       { if ( FNR==NR )                       # if this is the first file then make sure to populate the hdr[] array
            hdr[ndx]=gensub(/:$/,"","g",$1)   # strip trailing ":" from field #1; store in hdr[] array
         $1=""                                # clear field #1
         data[ndx]=gensub(/^ /,"","g",$0)     # strip leading " " from the line; store in data[] array
         ndx++                                # increment array index
         next
       }

END    { print_row() }                        # flush last set of data[] to stdout

' MR*MCA.csv

When run against the 2x sample data files this generates:

Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026

edited Jan 1, 2021 at 20:04

answered Jan 1, 2021 at 17:04

markp-fuso

38.5k5 gold badges24 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Adrian Mak Over a year ago

First of all thanks a lot for the time and effort you invested here! Your assumptions are all correct, besides the last one. I do know the number and spelling of all the labels and can provide them if helpful. Since I'm a novice in coding in general and in awk in particular, I might need a bit of time to implement this before I can give you a feedback. Simpling executing the code you provided in the folder of a few test csv files produces the following error for me:

Adrian Mak Over a year ago

awk: line 7: illegal reference to variable hdr awk: line 8: illegal reference to variable hdr awk: line 12: illegal reference to variable hdr awk: line 18: illegal reference to variable data awk: line 19: illegal reference to variable data

markp-fuso Over a year ago

hmmmm ... what's the output from running awk --version on your system? if those are the only error messages, they're all from array references inside the function; I'm wondering if your awk version requires the pre-declaration steps at the top of the script ?? I've edited the answer to move the BEGIN block to the top so the array pre-declarations now occur prior to the function definition.

Adrian Mak Over a year ago

awk --version didn't work, resulted in awk: not an option: --version. awk -W version gave a result, mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan. So updating mawk might bring the solution?

Adrian Mak Over a year ago

Running the edited code including BEGIN results in the following error:

awk: 2: unexpected character 0xc2 awk: 2: unexpected character 0xc2 awk: 3: unexpected character 0xc2 awk: 4: unexpected character 0xc2 awk: 5: unexpected character 0xc2

|

tripleee · Accepted Answer · 2021-01-01 20:54:29Z

1

CSV doesn't really make sense for the data you presented, and the pseudo-CSV you say you want as output makes little sense and will be hard to process further. Perhaps converting each input file to JSON would make more sense, and allow for processing using standard tools.

awk -F ': ' 'FNR==1 { name=$2 }
    FNR==2 { j = substr($0, length($1)+3); gsub(/\047/, "\042", j) }
    FNR==3 { sub(/^{/, "{\042name\042: \042" name "\042,", j);
        sub(/}$/, ",\042mean\042: " $2 "}", j);
        print j }' *.csv >output.jsonl

The output should be something like

{"name":"MR20584_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean": -917.2822725885565}
{"name":"MR30211_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean":-1024.287275914652}

This format is JSON lines, i.e. each line is valid JSON, but the file itself isn't proper JSON.

Demo: https://ideone.com/ue0Crd

Of course, if you can fix the tool which generated this useless format in the first place, even better.

edited Jan 1, 2021 at 20:54

answered Jan 1, 2021 at 16:36

tripleee

192k37 gold badges318 silver badges367 bronze badges

4 Comments

Ed Morton Over a year ago

Not sure why it's better to use \042 than \".

tripleee Over a year ago

I guess it's not better per se; do you think it's worse?

Ed Morton Over a year ago

For me personally it just makes the code slightly harder to read and it takes me a few secs to remind/convince myself that \042 is " and then I spend time trying to figure out what's causing the code to require \042s instead of just the character they represent before eventually concluding there's no practical reason for it so for me \042 is worse than " or \" as needed but I suppose YMMV. Down that path though - what other characters could we replace with ASCII escape sequences... could make for some very interesting scripts :-).

tripleee Over a year ago

Yeah, I could totally appreciate an Obfuscated Aw Context ... To boldly take Awk to the place where sed has supremely reigned ever since people stopped writing TECO macros.

Collectives™ on Stack Overflow

Combining multiple csv files in bash, splitting the lines into different rows

2 Answers 2

12 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related