0

Have a script

#!/bin/bash

sed -E 's/([^,]*,([^,]*),) ?(([[:alpha:]])[^ ]* +)(([^,]*),[^,]*,)[^,]*/\1\u\3\u\5\L\4\6\[email protected]/' file.csv > output.csv

Have a file.csv:

id,location_id,name,title,email,directorate
1,1, Amy Lee,Singer,,, 
2,2,brad Pitt,Actor,[email protected],Production 
3,5,steven spielberg,Producer,Screenwriter, [email protected],Production
4,8,Andy lee,Comedian,,Radio

A few problem, that I need resolve:

  • title value can be more than one title, example: Steve Spielberg - Producer,Screenwriter. Now the script cuts off the value after the comma, but I need to save all titles.
  • script concatenate first letter of first name and a last name, plus location_id and @google.com, but I need add location_id, only when have a equals emails.

In the end it should be:

id,location_id,name,title,email,directorate
1,1, Amy Lee,Singer,[email protected],, 
2,2,Brad Pitt,Actor,[email protected],Production 
3,5,Steven Spielberg,Producer,Screenwriter,[email protected],Production
4,8,Andy Lee,Comedian,[email protected],Radio
6
  • If you can have commas in the field of a CSV file, the field should be delimited with quotes. Unfortunately, regular expressions are generally not powerful enough to parse formats like that. Commented Nov 9, 2022 at 17:22
  • Generally not solvable imo. The floating entries could be associated to any column. Even if you check the @ in email and deduce its location you're out of luck if the email is missing. Commented Nov 9, 2022 at 17:42
  • @AndreWildberg maybe not with sed? Is there any other way to solve this problem from the start? Commented Nov 9, 2022 at 17:45
  • With a lot of assumptions (e.g. names have spaces etc) it could be solved for a special case/file. But there is no stable general approach other than a correct user input with quotes around the field. Imagine this entry ,,,,,,,,,,. Which one is the entry with comma inside a field? Only correct user input will tell you e.g. ,,,",,,,,",, Commented Nov 9, 2022 at 18:06
  • 1
    Making up random email addresses is a bad idea; you end up having spammers pick them up and send unwanted messages to real people if those addresses happen to be taken. Use @example.com for examples; it's guaranteed to not have this problem. Commented Nov 9, 2022 at 18:08

1 Answer 1

0

Assuming that directorate doesn't contain commas:

sed -E 's/([^,]*,([0-9]+), *([[:alpha:]])[^ ]* *([[:alpha:]]*).*,)(,[^,]*$)/\1\3\4\[email protected]\5/'

This always adds the location id number. It is possible to do it only for duplicate example emails by adding a second filter. Also consider using ID (first column) instead, in case of duplicate location + name.

Sign up to request clarification or add additional context in comments.

1 Comment

and how I can add a new filter?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.