2

currently we receive multiple large CSV files, which we need to insert/update into our database. The schema of our database does not change. We only need specific columns in a specfic order which are stated in a header-database. These could change at any given point. The CSV files we receive can also change in order at any time as well.

So what I did is piping the required columns from the header-DB into this script ($TEMP_FILE) and extract my required columns from the received CSV ($REC_CSV).

This is working fine thus far:

awk 'NR==FNR{
        Clm=Clm (Clm?"|":"")$1
        next
    }

    FNR==1{
        for (i=1;i<=NF;i++) 
        {
            if (match($i,Clm)) 
            {
                Ar[++n]=i
            }
        }
    }

    FNR>=2{
        for (i=1; i<=n; i++)
        {
            printf (i<n)? $(Ar[i]) FS : $(Ar[i])
        }
        printf "\n"
    }' FS="|" ${TEMP_FILE} $REC_CSV >> $MY_NEW_CSV_WITH_HEADER

TEMP_FILE (From Header DB):

id|anotherId|hahahaIdontExist|timestamp

Input-CSV:

id|timestamp|anotherId|thisId|andThisId|andAnotherId
1|2:00|34|44|44|41
2|2:00|34|45|44|41
3|3:00|35|46|44|41

Output:

id|anotherId|timestamp
1|34|2:00
2|34|2:00
3|35|3:00

But here is the problem: hahahaIdontExist is not in the output.

As it is vaguely implied this varaible was not supposed to exist in the first place, but needs to be in the output as well with empty FS.

Desired Output:

id|anotherId|hahahaIdontExist|timestamp
1|34||2:00
2|34||2:00
3|35||3:00

As I believe it is easier (and safer) to keep the first script (And I tried 1000st of drafts) do you have a suggestion on how to fill the non-existant columns into the output?

Best regards

2 Answers 2

3

With your shown samples, could you please try following. Written and tested in GNU awk.

awk '
BEGIN{
  FS=OFS="|"
}
FNR==NR{
  for(i=1;i<=NF;i++){
    arr[$i]=i
  }
  print
  next
}
FNR==1{
  PROCINFO["sorted_in"] = "@val_num_asc"
  num=split($0,currVal,"|")
  for(k=1;k<=num;k++){
    currVal1[currVal[k]]=k
  }
  for(u in arr){
    if(u in currVal1){
       realArr[++count]=currVal1[u]
       delete arr[u]
    }
    else{
       realArr[++count]="NA"
    }
  }
  next
}
{
  for(k=1;k<=count;k++){
     printf("%s%s",(realArr[k]!="NA"?$realArr[k]:OFS),(k==count?ORS:realArr[k]!="NA"?OFS:""))
  }
}
' temp_file  input.csv

Sample output will be as follows.

id|anotherId|hahahaIdontExist|timestamp
1|34||2:00
2|34||2:00
3|35||3:00
Sign up to request clarification or add additional context in comments.

Comments

1

I would do it this way:

awk -F\| -v outHeader="$(< "$TEMP_FILE")" '
NR == 1 {
  for (i = 1; i <= NF; ++i)
    inTitleToIdx[$i] = i
  idxEmptyField = NF+1000
  maxOutIdx = split(outHeader, outIdxToTitle)
  for (i = 1; i <= maxOutIdx; ++i) {
    inIdx = inTitleToIdx[outIdxToTitle[i]]
    outIdxToInIdx[i] = inIdx == "" ? idxEmptyField : inIdx 
  }
  print outHeader
}
NR > 1 {
  sep=""
  for (i = 1; i <= maxOutIdx; ++i) {
    printf "%s%s", sep, $outIdxToInIdx[i]
    sep = FS
  }
  print ""
}
' inputFile

Note: You don't need the temporary file $TEMP_FILE. You could also write -v outHeader="id|anotherId|hahahaIdontExist|timestamp" or -v outHeader="$(commandThatReadsTheHeaderFromTheDB)".

2 Comments

Thanks a lot! Almost perfect solution :) (Had to anonymize many components, but the general approach is perfect!)
@AtroCty Glad to hear that. Although I noticed that it was a bit inefficient. In the old version all fields from the input file were stored in an array even though some of them were not printed at all. I changed this answer to a new version which should be more efficient. (You still can see the old version by checking the version history of this answer)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.