0

I have a dataset with a column value :

0      TCGA-A2-A0T2
1      TCGA-A2-A0CM
2      TCGA-BH-A18V
3      TCGA-BH-A18Q
4      TCGA-BH-A0E0

However, I want to change it to:

A0T2
A0CM
A18V
A18Q
A0E0

I have tried code such as

df1['Complete TCGA ID'].str.extract('TCAG-(.*)-.*')

But it only returns NA. I really don't know how to figure out regular expression in this case. Can anyone please help? Thanks so much in advance!

2 Answers 2

2

You are looking for

df1['new_column'] = df1['Complete TCGA ID'].str.extract(r'-([^-]+)$')

See a demo on regex101.com.

Sign up to request clarification or add additional context in comments.

Comments

2

It should be TCGA instead, and you can match till the last - and then capture the rest in group 1.

TCGA.*-(.*)

Regex demo

Or a bit more precise match for the example data:

^TCGA-[A-Z0-9]+-([A-Z0-9]+)$

Regex demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.