Add missing columns in CSV by headers

Question

I have two files:

adjective,adverb,participle,verb 
0,2,3,5, 
1,2,5,6

and

adjective,adjunct,adverbial,participle,verb
0,2,3,5,4
1,2,5,6,5
1,2,5,6,5

I want to get output like this:

adjective,adjunct,adverb,adverbial,participle,verb
    0,2,0,3,5,4
    1,2,0,5,6,5
    1,2,0,5,6,5

So that the columns were merged based on the headers and sorted in the alphabetic order. I do not care about preserving the numbers from the second files in added columns, they can be filled with 0. The important part is to add the columns that are missing and sort them in the alphabetic order. Join does not help as it joins only by one column. Any ideas?

Because I want to fill the column that is added from the first file to the second with 0 as I do not want to keep the values. — Mar
– Mar, Commented Feb 12, 2015 at 14:33
Have you tried anything? I doubt that bash or sed will be able to do anything. Awk might, but you may be better off with Perl or Python that can handle more complex data structures. Are you familiar with either of those? — David W.
– David W., Commented Feb 12, 2015 at 14:38

glenn jackman · Accepted Answer · 2015-02-12 19:46:10Z

5

I don't understand why join isn't an option:

join -t, -a 1 -o 0,2.2,1.2,2.3,1.3,2.5 file1 file2

adjective,adjunct,adverb,adverbial,participle,verb
0,2,2,3,3,4
1,2,2,5,5,5
1,2,2,5,5,5

-a specified the join field for each file, and -o specifies the output format (which fields from which file)

I may come back to this later. In the meantime, you can extract the merged column headers like this:

paste -d , file1 file2 | sed 1q | tr , '\n' | sed 's/  *$//' | sort -u | paste -d, -s

adjective,adjunct,adverb,adverbial,participle,verb

OK, a GNU awk-only answer:

this reads the headers of file1 and the headers of file2 to get a unique set of headers.
uses the PROCINFO["sorted_in"] feature of gawk to traverse an associative array by lexically sorted order of the indices

gawk -F, '
    NR == 1 {
        n = split($0, f1cols, /,/)
        for (i=1; i<=n; i++) 
            allcols[f1cols[i]] = 1 
    }
    NR == FNR {next} # because you do not care about the values
    FNR == 1 {
        n = split($0, f2cols, /,/)
        for (i=1; i<=n; i++) {
            allcols[f2cols[i]] = 1
            f2colidx[f2cols[i]] = i
        }
        PROCINFO["sorted_in"] = "@ind_str_asc"
        sep = ""
        for (head in allcols) {
            printf "%s%s", sep, head
            sep = FS
        }
        print ""
        next
    }
    {
        sep = ""
        for (col in allcols) {
            val = (col in f2colidx) ? $(f2colidx[col]) : 0
            printf "%s%s", sep, val
            sep = FS
        }
        print ""
    }
' file1 file2

adjective,adjunct,adverb,adverbial,participle,verb
0,2,0,3,5,4
1,2,0,5,6,5
1,2,0,5,6,5

edited Feb 12, 2015 at 19:46

answered Feb 12, 2015 at 14:40

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mar Over a year ago

because: 1. I have files with different amount of columns (up to 200), join is lots of manual typing. 2. I need to join columns by headers not by index, ignoring the duplicating columns. 3. I need to fill the new column with 0 and not transfer the values for the column of another file.

Mar · Accepted Answer · 2015-02-13 11:53:36Z

0

I used a solution similar to this. Somehow I managed to apply awk to it and it seems to do what I want.

head -1 -q annotation1.csv  annotation2.csv | tr , "\n" | sort | uniq > header.txt
header="header.txt"

awk  -F, -v colsFile="$header" -v OFS=','  'BEGIN {
        j=1
        while ((getline < colsFile) > 0) col[j++] = $1
        n=j-1;
        close(colsFile)
        for (i=1; i<=n; i++) {
           s[col[i]]=i 
           printf(col[i])"," 
        }
        print""
     }
     NR==1 {
        for (f=1; f<=NF; f++) c[s[$f]]=f
        next
     }
     { 
       for (f=1; f<=n; f++) 
           if (c[f] == "") {printf 0","} else printf $(c[f])","
       print ""
   }' $1

edited Feb 13, 2015 at 11:53

answered Feb 12, 2015 at 16:10

Mar

615 bronze badges

2 Comments

glenn jackman Over a year ago

Fix your indentation: this is unreadable. It looks like you have not properly closed the BEGIN block.

Mar Over a year ago

Done. Thanks for the remark, I am still new to this:-)

Collectives™ on Stack Overflow

Add missing columns in CSV by headers

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related