bash Replace column in csv with a substring of that column

Question

I have a CSV and in one of the columns I have fields like

1. ABD_1&SC;1233;5665;123445
2. 120585_AOP9_AIDMS3&SC;0947;64820;0173

I need to replace this column with

1. ABD_1
2. AOP9_AIDMS3

Essentially from the first alphabetical character (the substring will never start with a numeric value) to the &. I thought I could use a

regex [a-zA-Z].+?(?=\&)

and awk to extract the column and replace it, but this is proving beyond my beginner skillset. Iterating over the string in some loop and writing some bash to parse it out is impractical as the file has some 20million+ entries.

Can anyone help?

If your real data has multiple columns then show multiple columns in your sample input/output. It matters when considering and testing potential solutions. — Ed Morton
– Ed Morton, Commented Nov 16, 2019 at 18:04

Amessihel · Accepted Answer · 2019-11-16 18:01:30Z

First step, assuming you have only one column in your csv (to understand the complete solution below):

One column

You can use this regex:

sed -r 's/^[^a-zA-Z]*([a-zA-Z]+[^&;]+).*$/\1/' test.csv

Explanations:

-r: use extended regular expressions (avoid parenthesis and plus + symbol escaping)
^[^a-zA-Z]*: skip any non-alpha characters at the beginning, ...
([a-zA-Z]+[^&;]+) ... then captures at least one alpha character followed by a sequence of any character except ampersand & and semi-colon ; ...
.*$ ... and skip any remaining characters (if any, they must begin by either an ampersand or a semi-colon since sed pattern matching is greedy, i.e. it tries to match the longest sequence) until the end of line ...
\1 ... and replace the whole matched text (the line since the regex covers it) by the captured sequence.

Working example:

$ sed -r 's/^[^a-zA-Z]*([a-zA-Z]+[^&;]+).*$/\1/' << 'EOF'
> ABD_1&SC;1233;5665;123445
> 120585_AOP9_AIDMS3&SC;0947;64820;0173
> EOF
ABD_1
AOP9_AIDMS3

Multiple columns:

It looks like you want to process a specific column. If you want to process the n-th column, you can use this regex, which is based on the previous:

sed -r 's/^(([^,]+,){2})[^a-zA-Z]*([a-zA-Z]+[^&;,]+)[^,]*(.*)$/\1\3\4/'

^(([^,]+,){<n-1>}) captures the first (n-1)th columns; replace <n-1> by the real value (0 for the first column works too), and then...
[^a-zA-Z]*([a-zA-Z]+[^&;,]+) captures at least one alpha character followed by a sequence of any character except ampersand &, semi-colon ; or a comma, then ...
[^,]* ... skip any remaining characters which are not a comma ...
(.*)$ ... and captures columns, basically the remaining sequence until the end of line; since any non-comma character was already skipped before, if this sequence exists, it must begin with a comma; finally ...
\1\3\4/ ... replace the whole matched text (the line since the regex covers it) by the following captured sequences:
- \1 : the (n-1)th columns (\2 is inside)
- \3 : the text we want to keep from the n-th column
- \4 : remaining columns if any

Working example (it processes the third column):

$ sed -r 's/^(([^,]+,){2})[^a-zA-Z]*([a-zA-Z]+[^&;,]+)[^,]*(.*)$/\1\3\4/' << 'EOF'
plaf,plafy,ABD_1&SC;1233;5665;123445,plet
trouf,troufi,120585_AOP9_AIDMS3&SC;0947;64820;0173,plot
EOF
plaf,plafy,ABD_1,plet
trouf,troufi,AOP9_AIDMS3,plot

Collectives™ on Stack Overflow

bash Replace column in csv with a substring of that column

1 Answer 1

One column

Multiple columns:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

One column

Multiple columns:

Comments

Your Answer

Sign up or log in

Post as a guest

Related