2

I've gotten >100 csv files, each containing >1000 measurements structured like the following two example files

MR44825_radiomics_MCA.csv

Case-1_Image: MR44825_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -917.2822725885565

MR47987_radiomics_MCA.csv

Case-1_Image: MR47987_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -442.31589128260026

The label is always some string of varying length, the distincter to the measurement is always the first :. Each measurement contains identical labels. The measurements themselves may contain , but then the related values are enclosed by {}.

Now I want to merge these files, preferably using bash. The output csv should be structured like the following:

Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026
3
  • 1
    We encourage questioners to show what they have tried so far to solve the problem themselves. Commented Jan 1, 2021 at 15:18
  • 3
    There is no CSV input file in your question. Please edit your question to contain a minimal reproducible example with concise, testable sample input (at least 2 CSVs since you're asking for help to merge multiple files) and the expected output given that input. Commented Jan 1, 2021 at 15:40
  • 1
    As you have commas in one of your datafield (json), this will not work at the end. Commented Jan 1, 2021 at 15:58

2 Answers 2

2

Assumptions:

  • OP has a reason for using a non-JSON format (and OP is 'ok' with having commas (,) as both delimiter and data)
  • all source files have the same number of lines
  • there are no blank lines in any of the source files
  • all source files have the same labels preceding the first :
  • all source files have their labels in the same order
  • the number, and spelling, of labels is not known up front (ie, we'll need to dynamically parse, store and print the labels)

One awk idea:

NOTE: a bit lengthy due to need to dynamically process labels

awk '
BEGIN  { split("",hdr)                        # declare hdr as an array
         split("",data)                       # declare data as an array
         ndx=1                                # init array index
       }

function print_row() {                        # function to print a row

pfx=""                                        # first column will have a NULL prefix

if ( length(hdr) > 0  )                       # print the header row?
   { for ( i in hdr )
       { printf "%s%s", pfx, hdr[i]
         pfx=","                              # 2nd-nth columns will have a "," prefix
       }
     printf "\n"
     split("",hdr)                            # clear hdr[] array so we do not print it again
   }

pfx=""                                        # reset prefix for printing data row

if ( length(data) > 0 )                       # print a data row?
   { for ( i in data )
         { printf "%s%s", pfx, data[i]
           pfx=","                            # 2nd-nth columns will have a "," prefix
         }
     printf "\n"
     split("",data)                           # clear the data[] array for the next file
     ndx=1                                    # reset our array index for the next file
   }
}

FNR==1 { print_row() }                        # if this is a new file then print contents of last file

       { if ( FNR==NR )                       # if this is the first file then make sure to populate the hdr[] array
            hdr[ndx]=gensub(/:$/,"","g",$1)   # strip trailing ":" from field #1; store in hdr[] array
         $1=""                                # clear field #1
         data[ndx]=gensub(/^ /,"","g",$0)     # strip leading " " from the line; store in data[] array
         ndx++                                # increment array index
         next
       }

END    { print_row() }                        # flush last set of data[] to stdout

' MR*MCA.csv

When run against the 2x sample data files this generates:

Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026
Sign up to request clarification or add additional context in comments.

12 Comments

First of all thanks a lot for the time and effort you invested here! Your assumptions are all correct, besides the last one. I do know the number and spelling of all the labels and can provide them if helpful. Since I'm a novice in coding in general and in awk in particular, I might need a bit of time to implement this before I can give you a feedback. Simpling executing the code you provided in the folder of a few test csv files produces the following error for me:
awk: line 7: illegal reference to variable hdr awk: line 8: illegal reference to variable hdr awk: line 12: illegal reference to variable hdr awk: line 18: illegal reference to variable data awk: line 19: illegal reference to variable data
hmmmm ... what's the output from running awk --version on your system? if those are the only error messages, they're all from array references inside the function; I'm wondering if your awk version requires the pre-declaration steps at the top of the script ?? I've edited the answer to move the BEGIN block to the top so the array pre-declarations now occur prior to the function definition.
awk --version didn't work, resulted in awk: not an option: --version. awk -W version gave a result, mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan. So updating mawk might bring the solution?
Running the edited code including BEGIN results in the following error: awk: 2: unexpected character 0xc2 awk: 2: unexpected character 0xc2 awk: 3: unexpected character 0xc2 awk: 4: unexpected character 0xc2 awk: 5: unexpected character 0xc2
|
1

CSV doesn't really make sense for the data you presented, and the pseudo-CSV you say you want as output makes little sense and will be hard to process further. Perhaps converting each input file to JSON would make more sense, and allow for processing using standard tools.

awk -F ': ' 'FNR==1 { name=$2 }
    FNR==2 { j = substr($0, length($1)+3); gsub(/\047/, "\042", j) }
    FNR==3 { sub(/^{/, "{\042name\042: \042" name "\042,", j);
        sub(/}$/, ",\042mean\042: " $2 "}", j);
        print j }' *.csv >output.jsonl

The output should be something like

{"name":"MR20584_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean": -917.2822725885565}
{"name":"MR30211_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean":-1024.287275914652}

This format is JSON lines, i.e. each line is valid JSON, but the file itself isn't proper JSON.

Demo: https://ideone.com/ue0Crd

Of course, if you can fix the tool which generated this useless format in the first place, even better.

4 Comments

Not sure why it's better to use \042 than \".
I guess it's not better per se; do you think it's worse?
For me personally it just makes the code slightly harder to read and it takes me a few secs to remind/convince myself that \042 is " and then I spend time trying to figure out what's causing the code to require \042s instead of just the character they represent before eventually concluding there's no practical reason for it so for me \042 is worse than " or \" as needed but I suppose YMMV. Down that path though - what other characters could we replace with ASCII escape sequences... could make for some very interesting scripts :-).
Yeah, I could totally appreciate an Obfuscated Aw Context ... To boldly take Awk to the place where sed has supremely reigned ever since people stopped writing TECO macros.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.