currently we receive multiple large CSV files, which we need to insert/update into our database. The schema of our database does not change. We only need specific columns in a specfic order which are stated in a header-database. These could change at any given point. The CSV files we receive can also change in order at any time as well.
So what I did is piping the required columns from the header-DB into this script ($TEMP_FILE) and extract my required columns from the received CSV ($REC_CSV).
This is working fine thus far:
awk 'NR==FNR{
Clm=Clm (Clm?"|":"")$1
next
}
FNR==1{
for (i=1;i<=NF;i++)
{
if (match($i,Clm))
{
Ar[++n]=i
}
}
}
FNR>=2{
for (i=1; i<=n; i++)
{
printf (i<n)? $(Ar[i]) FS : $(Ar[i])
}
printf "\n"
}' FS="|" ${TEMP_FILE} $REC_CSV >> $MY_NEW_CSV_WITH_HEADER
TEMP_FILE (From Header DB):
id|anotherId|hahahaIdontExist|timestamp
Input-CSV:
id|timestamp|anotherId|thisId|andThisId|andAnotherId
1|2:00|34|44|44|41
2|2:00|34|45|44|41
3|3:00|35|46|44|41
Output:
id|anotherId|timestamp
1|34|2:00
2|34|2:00
3|35|3:00
But here is the problem: hahahaIdontExist is not in the output.
As it is vaguely implied this varaible was not supposed to exist in the first place, but needs to be in the output as well with empty FS.
Desired Output:
id|anotherId|hahahaIdontExist|timestamp
1|34||2:00
2|34||2:00
3|35||3:00
As I believe it is easier (and safer) to keep the first script (And I tried 1000st of drafts) do you have a suggestion on how to fill the non-existant columns into the output?
Best regards