Add a column to a csv file using a bash script

Question

I would like to append a column to a csv file using a bash script given a condition. The condition is that the column in file1.csv must have more than one unique value to be added to newfile.csv. These are not the real files. The original file has a lot more columns/rows.

Something like this:

file1.csv

1, ah, th, ab, a
2, ah, jk, ab, b
3, ah, lk, ab, c
4, ah, hh, ab, d

newfile.csv should be:

1, th, a
2, jk, b
3, lk, c
4, hh, d

This is the script I tried. However, it does not append the new columns. The output is just a csv with the last column of file1.csv that had more than one unique value.

#!/bin/bash
cut -d, -f1 file1.csv > newfile.csv
limit=1
for i in $(seq 2 5); do
   value=$(cat file1.csv | cut -d, -f$i | uniq -u | wc -l)
   if [ $value -gt $limit ]; then
        paste -d, newfile.csv <(cut -d, -f$i file1.csv) > newfile.csv
   else echo "Column $i not appended."
   fi
done

I suspect it may have something to do with the fact I have newfile.csv twice in one line. I tried creating a new file newfile2.csv for each interaction, but that did not work. I am new to Bash.

Does each line have the same number of columns?

M. Nejat Aydin
– M. Nejat Aydin

2021-03-06 11:11:57 +00:00
Commented Mar 6, 2021 at 11:11 — M. Nejat Aydin
– M. Nejat Aydin, Commented Mar 6, 2021 at 11:11
How big are the files, can they fit into memory?

James Brown
– James Brown

2021-03-06 11:33:31 +00:00
Commented Mar 6, 2021 at 11:33 — James Brown
– James Brown, Commented Mar 6, 2021 at 11:33

anubhava · Accepted Answer · 2021-03-06 11:35:39Z

1

You may use this 2 phase awk solution:

awk 'BEGIN {FS=OFS=", "} FNR==NR {for (i=1; i<=NF; ++i) if (!seen[i,$i]++) ++fq[i]; next} {s=""; for (i=1; i<=NF; ++i) if (fq[i] > 1) s = (s == "" ? "" : s OFS ) $i; print s}' file{,}

1, th, a
2, jk, b
3, lk, c
4, hh, d

Expanded form:

awk 'BEGIN {
   FS = OFS = ", "
}
FNR == NR {
   for (i=1; i<=NF; ++i)
      if (!seen[i,$i]++)
         ++fq[i]
      next
}
{
   s = ""
   for (i=1; i<=NF; ++i)
      if (fq[i] > 1)
         s = (s == "" ? "" : s OFS ) $i
   print s
}' file{,}

edited Mar 6, 2021 at 11:35

answered Mar 6, 2021 at 11:23

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Trex Over a year ago

Can you expand on how to save this as a new csv file? When I tried it on the real csv it did not remove the columns with only one unique value...

anubhava Over a year ago

Just > outfile at the end of awk command to redirect output to a new file. I have shown generated output in my answer.

Trex Over a year ago

That's what I did. The new file still has all the columns of the original file though.

anubhava Over a year ago

That means your actual input is not same as shown in question. If you provide your actual input and show your expected output then I can trace.

Trex Over a year ago

There at the bottom of the code: should I run file.csv{,}? It seems like I need to specify the file extension for it to work.

Ed Morton · Accepted Answer · 2021-03-07 00:22:21Z

1

Using any awk in any shell on every Unix box, this will work efficiently and use minimal memory:

$ cat tst.awk
BEGIN { FS=OFS=", " }
NR==FNR {
    if ( NR == 1 ) {
        split($0,uniq)
    }
    for (inFldNr in uniq) {
        if ( seen[inFldNr,$inFldNr]++ ) {
            delete seen[inFldNr,$inFldNr]
            delete uniq[inFldNr]
        }
    }
    next
}
FNR==1 {
    for (inFldNr=1; inFldNr<=NF; inFldNr++) {
        if (inFldNr in uniq) {
            out2inFldNr[++numOutFlds] = inFldNr
        }
    }
}
{
    for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
        inFldNr = out2inFldNr[outFldNr]
        printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
    }
}

$ awk -f tst.awk file1.csv file1.csv
1, th, a
2, jk, b
3, lk, c
4, hh, d

edited Mar 7, 2021 at 0:22

answered Mar 7, 2021 at 0:14

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Comments

karakfa · Accepted Answer · 2021-03-06 15:53:28Z

0

another similar awk with double scanning the file

$ awk -v F', ' 'NR==FNR {for(i=1;c[i]<2 && i<=NF;i++) if(!f[i,$i]++) c[i]++; next}
                FNR==1  {for(i=1;i<=NF;i++) if(c[i]>1) a[++k]=i}
                        {for(i=1;i<=k;i++) printf "%s%s",$(a[i]),i==k?ORS:FS}' file{,}

1, th, a
2, jk, b
3, lk, c
4, hh, d

short circuits columns already has more than one unique value, and while printing only scans the non-unique columns

The file{,} notation means file file, to provide the input file twice due to the double scanning algorithm.

answered Mar 6, 2021 at 15:53

karakfa

67.8k8 gold badges45 silver badges59 bronze badges

Comments

Trex · Accepted Answer · 2021-03-09 07:59:13Z

0

Problem solved with renaming the file inside the script:

#!/bin/bash
cut -d, -f1 file1.csv > newfile.csv
limit=1
for i in $(seq 2 5); do
   value=$(cat file1.csv | cut -d, -f$i | uniq -u | wc -l)
   if [ $value -gt $limit ]; then
        cut -d, -f$i file.csv > column.csv
        paste -d, newfile.csv column.csv > newfile2.csv
        cp newfile2.csv newfile.csv
   else echo "Column $i not appended."
   fi
done

answered Mar 9, 2021 at 7:59

Trex

6605 silver badges11 bronze badges

Collectives™ on Stack Overflow

Add a column to a csv file using a bash script

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related