Add a character to duplicate emails using bash only

Question

Input data:

id,location_id,name,title,[email protected],department
1,1,Susan houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
8,6,Bart charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],

I need to add a character to duplicate emails after the email part, i.e. [email protected] would become [email protected] (the digit after the email part needs to be taken from the second column). How can I do that in Bash for all entries?

This seems like it might be a followup to your recent question on SO, stackoverflow.com/questions/79541962/…. Not sure why you'd switch sites for a followup but you should at least reference that previous question to provide some context for this one. FWIW, my answer to this question is not the same as the answer I would have given if you wanted to add this functionality while populating that email field as in your previous question. — Ed Morton
– Ed Morton, Commented Mar 29 at 13:23
Duplicate with: How to edit next line after pattern using sed? — Luuk
– Luuk, Commented Mar 29 at 13:47
It’s unclear what you mean by “using bash only”. Do you mean to implement a CSV parser in plain bash? — Kusalananda
– Kusalananda ♦, Commented Mar 29 at 17:18
@NetRanger for future question you might want to say "Using bash plus mandatory POSIX tools (and GNU equivalents?)" or similar to clarify your restrictions on available tools. — Ed Morton
– Ed Morton, Commented Mar 30 at 11:08

terdon · Accepted Answer · 2025-03-30 13:06:24Z

3

Assuming that when you say:

[email protected] would become [email protected]

you're just referring to the 2nd occurrence of that email address, not all of them, then using any awk:

$ awk 'BEGIN{FS=OFS=","} {f=NF-1} NR>1 && seen[$f]++{sub(/@/,$2"&",$f)} 1' file
id,location_id,name,title,[email protected],department
1,1,Susan houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
8,6,Bart charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],

The core of this is the sub() command which is replacing the @ with the value of the second field ($2) and then itself (& is replaced by whatever was matched in the first part).

For more information on processing CSV with awk, see whats-the-most-robust-way-to-efficiently-parse-csv-using-awk.

edited Mar 30 at 13:06

terdon♦

253k69 gold badges481 silver badges719 bronze badges

answered Mar 29 at 13:19

Ed Morton

36k6 gold badges25 silver badges60 bronze badges

What if I need to replace every occurrence (so not just every 2nd)? My task description changes on fly because the environment where it gets tested is not the same as mine.

NetRanger
– NetRanger

2025-04-04 15:06:38 +00:00
Commented Apr 4 at 15:06
1

@NetRanger post a new question as Chameleon Questions are strongly discouraged and it's hard to answer any question without sample input/output that demonstrates the requirements and would let you test a potential solution.

Ed Morton
– Ed Morton

2025-04-04 15:23:25 +00:00
Commented Apr 4 at 15:23
Well, I can't create a new one because users complain that I shouldn't do it. I suspect some of them are just trolls whereas I actually need some help with coding. All I need to know is what in your code modifies every 2nd occurrence (which I can change to every).

NetRanger
– NetRanger

2025-04-04 17:52:10 +00:00
Commented Apr 4 at 17:52
No-ones going to complain about you creating a new question referencing this one and saying it's a followup. Just make sure to use the answer you got here as the starting point for the new question rather than asking it again from the perspective of your original question and also show your attempts to modify this answer to solve your new problem.

Ed Morton
– Ed Morton

2025-04-04 17:55:48 +00:00
Commented Apr 4 at 17:55
I did, and it was heavily downvoted, no idea why. I guess they thought it was exactly as this one.

NetRanger
– NetRanger

2025-04-04 17:58:22 +00:00
Commented Apr 4 at 17:58

Add a comment |

Kusalananda · Accepted Answer · 2025-04-04 18:14:15Z

Using Miller (mlr) to read the data as CSV, count the number of times each value of the [email protected] field occurs (adds a temporary field called count), modifies the [email protected] field if needed (if count is greater than 1), and then deletes the temporary count field.

mlr --from input.csv --csv \
    count-similar -g '[email protected]' then \
    put '$count > 1 {
        a = splita($["[email protected]"], "@");
        $["[email protected]"] = a[1] . $location_id . "@" . a[2];
    }' then \
    cut -x -f count

The modification of the [email protected] field is triggered by the $count > 1 test and is carried out by splitting the field on the @ character and then splicing the parts together again with the value of the location_id field inserted.

Instead of a split+join operation, you could do this with a sub() call, similar to what Ed Morton shows in his awk code:

mlr --from input.csv --csv \
    count-similar -g '[email protected]' then \
    put '$count > 1 {
        $["[email protected]"] = sub($["[email protected]"], "@", $location_id . "@");
    }' then \
    cut -x -f count

The result:

id,location_id,name,title,[email protected],department
1,1,Susan houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
8,6,Bart charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],

Using "only bash" (no external utilities):

declare -A seen
while IFS= read -r line; do
        addr=${line%,*}
        addr=${addr##*,}

        if [ "${seen[$addr]}" = 1 ]; then
                loc=${line#*,}
                loc=${loc%%,*}

                if [[ $line =~ (.*)@(.*) ]]; then
                        line=${BASH_REMATCH[1]}$loc@${BASH_REMATCH[2]}
                fi
        fi
        seen[$addr]=1

        printf '%s\n' "$line"
done <input.csv

This makes a number of assumptions about the input that the CSV-aware code at the start of this answer (using Miller) would handle without issue:

The fields are fixed and do not move around between runs.
The 1st, 2nd, penultimate, and last fields never contain embedded commas.
There is no value in the [email protected] field that is [email protected].
No field in the entire file contains an embedded newline.
The @ character occurs only once on each input line, and it's in the [email protected] field.
The 2nd field is never quoted.

Output:

id,location_id,name,title,[email protected],department
1,1,Susan houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
8,6,Bart charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],

By counting how many times each address is found, using a separate initial pass over the data, we can modify the script to add the number to each duplicate address:

declare -A seen

while IFS= read -r line; do
        addr=${line%,*}
        addr=${addr##*,}

        seen[$addr]=$(( seen[$addr] + 1 ))
done <input.csv

while IFS= read -r line; do
        addr=${line%,*}
        addr=${addr##*,}

        if [ "${seen[$addr]}" -gt 1 ]; then
                loc=${line#*,}
                loc=${loc%%,*}

                if [[ $line =~ (.*)@(.*) ]]; then
                        line=${BASH_REMATCH[1]}$loc@${BASH_REMATCH[2]}
                fi
        fi

        printf '%s\n' "$line"
done <input.csv

This code obviously has the same restrictions as the previous bash script snippet.

Output:

id,location_id,name,title,[email protected],department
1,1,Susan houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
8,6,Bart charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],

@NetRanger Hmm, and still you have accepted an answer which doesn’t use bash at all? Please explain. — Kusalananda
– Kusalananda ♦, Commented Mar 29 at 17:14
by Bash only, I meant using only packages that come pre-installed with an OS. — NetRanger
– NetRanger, Commented Mar 29 at 18:46
@NetRanger On my OSes (Alpine Linux and OpenBSD), bash is not pre-installed. Installing it is as easy as installing mlr though. — Kusalananda
– Kusalananda ♦, Commented Mar 29 at 20:26

Stack Exchange Network

Add a character to duplicate emails using bash only

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Add a character to duplicate emails using bash only

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions