0

I would like to append a column to a csv file using a bash script given a condition. The condition is that the column in file1.csv must have more than one unique value to be added to newfile.csv. These are not the real files. The original file has a lot more columns/rows.

Something like this:

file1.csv

1, ah, th, ab, a
2, ah, jk, ab, b
3, ah, lk, ab, c
4, ah, hh, ab, d

newfile.csv should be:

1, th, a
2, jk, b
3, lk, c
4, hh, d

This is the script I tried. However, it does not append the new columns. The output is just a csv with the last column of file1.csv that had more than one unique value.

#!/bin/bash
cut -d, -f1 file1.csv > newfile.csv
limit=1
for i in $(seq 2 5); do
   value=$(cat file1.csv | cut -d, -f$i | uniq -u | wc -l)
   if [ $value -gt $limit ]; then
        paste -d, newfile.csv <(cut -d, -f$i file1.csv) > newfile.csv
   else echo "Column $i not appended."
   fi
done

I suspect it may have something to do with the fact I have newfile.csv twice in one line. I tried creating a new file newfile2.csv for each interaction, but that did not work. I am new to Bash.

2
  • Does each line have the same number of columns? Commented Mar 6, 2021 at 11:11
  • 1
    How big are the files, can they fit into memory? Commented Mar 6, 2021 at 11:33

4 Answers 4

1

You may use this 2 phase awk solution:

awk 'BEGIN {FS=OFS=", "} FNR==NR {for (i=1; i<=NF; ++i) if (!seen[i,$i]++) ++fq[i]; next} {s=""; for (i=1; i<=NF; ++i) if (fq[i] > 1) s = (s == "" ? "" : s OFS ) $i; print s}' file{,}

1, th, a
2, jk, b
3, lk, c
4, hh, d

Expanded form:

awk 'BEGIN {
   FS = OFS = ", "
}
FNR == NR {
   for (i=1; i<=NF; ++i)
      if (!seen[i,$i]++)
         ++fq[i]
      next
}
{
   s = ""
   for (i=1; i<=NF; ++i)
      if (fq[i] > 1)
         s = (s == "" ? "" : s OFS ) $i
   print s
}' file{,}
Sign up to request clarification or add additional context in comments.

5 Comments

Can you expand on how to save this as a new csv file? When I tried it on the real csv it did not remove the columns with only one unique value...
Just > outfile at the end of awk command to redirect output to a new file. I have shown generated output in my answer.
That's what I did. The new file still has all the columns of the original file though.
That means your actual input is not same as shown in question. If you provide your actual input and show your expected output then I can trace.
There at the bottom of the code: should I run file.csv{,}? It seems like I need to specify the file extension for it to work.
1

Using any awk in any shell on every Unix box, this will work efficiently and use minimal memory:

$ cat tst.awk
BEGIN { FS=OFS=", " }
NR==FNR {
    if ( NR == 1 ) {
        split($0,uniq)
    }
    for (inFldNr in uniq) {
        if ( seen[inFldNr,$inFldNr]++ ) {
            delete seen[inFldNr,$inFldNr]
            delete uniq[inFldNr]
        }
    }
    next
}
FNR==1 {
    for (inFldNr=1; inFldNr<=NF; inFldNr++) {
        if (inFldNr in uniq) {
            out2inFldNr[++numOutFlds] = inFldNr
        }
    }
}
{
    for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
        inFldNr = out2inFldNr[outFldNr]
        printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
    }
}

$ awk -f tst.awk file1.csv file1.csv
1, th, a
2, jk, b
3, lk, c
4, hh, d

Comments

0

another similar awk with double scanning the file

$ awk -v F', ' 'NR==FNR {for(i=1;c[i]<2 && i<=NF;i++) if(!f[i,$i]++) c[i]++; next}
                FNR==1  {for(i=1;i<=NF;i++) if(c[i]>1) a[++k]=i}
                        {for(i=1;i<=k;i++) printf "%s%s",$(a[i]),i==k?ORS:FS}' file{,}

1, th, a
2, jk, b
3, lk, c
4, hh, d

short circuits columns already has more than one unique value, and while printing only scans the non-unique columns

The file{,} notation means file file, to provide the input file twice due to the double scanning algorithm.

Comments

0

Problem solved with renaming the file inside the script:

#!/bin/bash
cut -d, -f1 file1.csv > newfile.csv
limit=1
for i in $(seq 2 5); do
   value=$(cat file1.csv | cut -d, -f$i | uniq -u | wc -l)
   if [ $value -gt $limit ]; then
        cut -d, -f$i file.csv > column.csv
        paste -d, newfile.csv column.csv > newfile2.csv
        cp newfile2.csv newfile.csv
   else echo "Column $i not appended."
   fi
done

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.