1

I have a CSV and in one of the columns I have fields like

1. ABD_1≻1233;5665;123445
2. 120585_AOP9_AIDMS3≻0947;64820;0173

I need to replace this column with

1. ABD_1
2. AOP9_AIDMS3

Essentially from the first alphabetical character (the substring will never start with a numeric value) to the &. I thought I could use a

regex [a-zA-Z].+?(?=\&)

and awk to extract the column and replace it, but this is proving beyond my beginner skillset. Iterating over the string in some loop and writing some bash to parse it out is impractical as the file has some 20million+ entries.

Can anyone help?

1
  • If your real data has multiple columns then show multiple columns in your sample input/output. It matters when considering and testing potential solutions. Commented Nov 16, 2019 at 18:04

1 Answer 1

1

First step, assuming you have only one column in your csv (to understand the complete solution below):

One column

You can use this regex:

sed -r 's/^[^a-zA-Z]*([a-zA-Z]+[^&;]+).*$/\1/' test.csv

Explanations:

  • -r: use extended regular expressions (avoid parenthesis and plus + symbol escaping)
  • ^[^a-zA-Z]*: skip any non-alpha characters at the beginning, ...
  • ([a-zA-Z]+[^&;]+) ... then captures at least one alpha character followed by a sequence of any character except ampersand & and semi-colon ; ...
  • .*$ ... and skip any remaining characters (if any, they must begin by either an ampersand or a semi-colon since sed pattern matching is greedy, i.e. it tries to match the longest sequence) until the end of line ...
  • \1 ... and replace the whole matched text (the line since the regex covers it) by the captured sequence.

Working example:

$ sed -r 's/^[^a-zA-Z]*([a-zA-Z]+[^&;]+).*$/\1/' << 'EOF'
> ABD_1&SC;1233;5665;123445
> 120585_AOP9_AIDMS3&SC;0947;64820;0173
> EOF
ABD_1
AOP9_AIDMS3

Multiple columns:

It looks like you want to process a specific column. If you want to process the n-th column, you can use this regex, which is based on the previous:

sed -r 's/^(([^,]+,){2})[^a-zA-Z]*([a-zA-Z]+[^&;,]+)[^,]*(.*)$/\1\3\4/'
  • ^(([^,]+,){<n-1>}) captures the first (n-1)th columns; replace <n-1> by the real value (0 for the first column works too), and then...
  • [^a-zA-Z]*([a-zA-Z]+[^&;,]+) captures at least one alpha character followed by a sequence of any character except ampersand &, semi-colon ; or a comma, then ...
  • [^,]* ... skip any remaining characters which are not a comma ...
  • (.*)$ ... and captures columns, basically the remaining sequence until the end of line; since any non-comma character was already skipped before, if this sequence exists, it must begin with a comma; finally ...
  • \1\3\4/ ... replace the whole matched text (the line since the regex covers it) by the following captured sequences:
    • \1 : the (n-1)th columns (\2 is inside)
    • \3 : the text we want to keep from the n-th column
    • \4 : remaining columns if any

Working example (it processes the third column):

$ sed -r 's/^(([^,]+,){2})[^a-zA-Z]*([a-zA-Z]+[^&;,]+)[^,]*(.*)$/\1\3\4/' << 'EOF'
plaf,plafy,ABD_1&SC;1233;5665;123445,plet
trouf,troufi,120585_AOP9_AIDMS3&SC;0947;64820;0173,plot
EOF
plaf,plafy,ABD_1,plet
trouf,troufi,AOP9_AIDMS3,plot
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.