Using Miller (mlr) to read the data as CSV, count the number of times each value of the [email protected] field occurs (adds a temporary field called count), modifies the [email protected] field if needed (if count is greater than 1), and then deletes the temporary count field.
mlr --from input.csv --csv \
count-similar -g '[email protected]' then \
put '$count > 1 {
a = splita($["[email protected]"], "@");
$["[email protected]"] = a[1] . $location_id . "@" . a[2];
}' then \
cut -x -f count
The modification of the [email protected] field is triggered by the $count > 1 test and is carried out by splitting the field on the @ character and then splicing the parts together again with the value of the location_id field inserted.
Instead of a split+join operation, you could do this with a sub() call, similar to what Ed Morton shows in his awk code:
mlr --from input.csv --csv \
count-similar -g '[email protected]' then \
put '$count > 1 {
$["[email protected]"] = sub($["[email protected]"], "@", $location_id . "@");
}' then \
cut -x -f count
The result:
id,location_id,name,title,[email protected],department
1,1,Susan houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
8,6,Bart charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],
Using "only bash" (no external utilities):
declare -A seen
while IFS= read -r line; do
addr=${line%,*}
addr=${addr##*,}
if [ "${seen[$addr]}" = 1 ]; then
loc=${line#*,}
loc=${loc%%,*}
if [[ $line =~ (.*)@(.*) ]]; then
line=${BASH_REMATCH[1]}$loc@${BASH_REMATCH[2]}
fi
fi
seen[$addr]=1
printf '%s\n' "$line"
done <input.csv
This makes a number of assumptions about the input that the CSV-aware code at the start of this answer (using Miller) would handle without issue:
- The fields are fixed and do not move around between runs.
- The 1st, 2nd, penultimate, and last fields never contain embedded commas.
- There is no value in the
[email protected] field that is [email protected].
- No field in the entire file contains an embedded newline.
- The
@ character occurs only once on each input line, and it's in the [email protected] field.
- The 2nd field is never quoted.
Output:
id,location_id,name,title,[email protected],department
1,1,Susan houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
8,6,Bart charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],
By counting how many times each address is found, using a separate initial pass over the data, we can modify the script to add the number to each duplicate address:
declare -A seen
while IFS= read -r line; do
addr=${line%,*}
addr=${addr##*,}
seen[$addr]=$(( seen[$addr] + 1 ))
done <input.csv
while IFS= read -r line; do
addr=${line%,*}
addr=${addr##*,}
if [ "${seen[$addr]}" -gt 1 ]; then
loc=${line#*,}
loc=${loc%%,*}
if [[ $line =~ (.*)@(.*) ]]; then
line=${BASH_REMATCH[1]}$loc@${BASH_REMATCH[2]}
fi
fi
printf '%s\n' "$line"
done <input.csv
This code obviously has the same restrictions as the previous bash script snippet.
Output:
id,location_id,name,title,[email protected],department
1,1,Susan houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
8,6,Bart charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],