Bash: Add empty column(s) to CSV with unknown positions

Question

currently we receive multiple large CSV files, which we need to insert/update into our database. The schema of our database does not change. We only need specific columns in a specfic order which are stated in a header-database. These could change at any given point. The CSV files we receive can also change in order at any time as well.

So what I did is piping the required columns from the header-DB into this script ($TEMP_FILE) and extract my required columns from the received CSV ($REC_CSV).

This is working fine thus far:

awk 'NR==FNR{
        Clm=Clm (Clm?"|":"")$1
        next
    }

    FNR==1{
        for (i=1;i<=NF;i++) 
        {
            if (match($i,Clm)) 
            {
                Ar[++n]=i
            }
        }
    }

    FNR>=2{
        for (i=1; i<=n; i++)
        {
            printf (i<n)? $(Ar[i]) FS : $(Ar[i])
        }
        printf "\n"
    }' FS="|" ${TEMP_FILE} $REC_CSV >> $MY_NEW_CSV_WITH_HEADER

TEMP_FILE (From Header DB):

id|anotherId|hahahaIdontExist|timestamp

Input-CSV:

id|timestamp|anotherId|thisId|andThisId|andAnotherId
1|2:00|34|44|44|41
2|2:00|34|45|44|41
3|3:00|35|46|44|41

Output:

id|anotherId|timestamp
1|34|2:00
2|34|2:00
3|35|3:00

But here is the problem: hahahaIdontExist is not in the output.

As it is vaguely implied this varaible was not supposed to exist in the first place, but needs to be in the output as well with empty FS.

Desired Output:

id|anotherId|hahahaIdontExist|timestamp
1|34||2:00
2|34||2:00
3|35||3:00

As I believe it is easier (and safer) to keep the first script (And I tried 1000st of drafts) do you have a suggestion on how to fill the non-existant columns into the output?

Best regards

RavinderSingh13 · Accepted Answer · 2021-03-04 12:57:10Z

3

With your shown samples, could you please try following. Written and tested in GNU awk.

awk '
BEGIN{
  FS=OFS="|"
}
FNR==NR{
  for(i=1;i<=NF;i++){
    arr[$i]=i
  }
  print
  next
}
FNR==1{
  PROCINFO["sorted_in"] = "@val_num_asc"
  num=split($0,currVal,"|")
  for(k=1;k<=num;k++){
    currVal1[currVal[k]]=k
  }
  for(u in arr){
    if(u in currVal1){
       realArr[++count]=currVal1[u]
       delete arr[u]
    }
    else{
       realArr[++count]="NA"
    }
  }
  next
}
{
  for(k=1;k<=count;k++){
     printf("%s%s",(realArr[k]!="NA"?$realArr[k]:OFS),(k==count?ORS:realArr[k]!="NA"?OFS:""))
  }
}
' temp_file  input.csv

Sample output will be as follows.

id|anotherId|hahahaIdontExist|timestamp
1|34||2:00
2|34||2:00
3|35||3:00

edited Mar 4, 2021 at 12:57

answered Mar 4, 2021 at 12:40

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Socowi · Accepted Answer · 2021-03-04 12:34:25Z

1

I would do it this way:

awk -F\| -v outHeader="$(< "$TEMP_FILE")" '
NR == 1 {
  for (i = 1; i <= NF; ++i)
    inTitleToIdx[$i] = i
  idxEmptyField = NF+1000
  maxOutIdx = split(outHeader, outIdxToTitle)
  for (i = 1; i <= maxOutIdx; ++i) {
    inIdx = inTitleToIdx[outIdxToTitle[i]]
    outIdxToInIdx[i] = inIdx == "" ? idxEmptyField : inIdx 
  }
  print outHeader
}
NR > 1 {
  sep=""
  for (i = 1; i <= maxOutIdx; ++i) {
    printf "%s%s", sep, $outIdxToInIdx[i]
    sep = FS
  }
  print ""
}
' inputFile

Note: You don't need the temporary file $TEMP_FILE. You could also write -v outHeader="id|anotherId|hahahaIdontExist|timestamp" or -v outHeader="$(commandThatReadsTheHeaderFromTheDB)".

edited Mar 4, 2021 at 12:34

answered Mar 4, 2021 at 11:36

Socowi

27.9k4 gold badges41 silver badges72 bronze badges

2 Comments

reYal Over a year ago

Thanks a lot! Almost perfect solution :) (Had to anonymize many components, but the general approach is perfect!)

Socowi Over a year ago

@AtroCty Glad to hear that. Although I noticed that it was a bit inefficient. In the old version all fields from the input file were stored in an array even though some of them were not printed at all. I changed this answer to a new version which should be more efficient. (You still can see the old version by checking the version history of this answer)

Collectives™ on Stack Overflow

Bash: Add empty column(s) to CSV with unknown positions

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related