1

Hello I have a dataframe such as

COL1 
scaffold_6202_0_5660-8393_+__Apis_cerana
scaffold_27087_2-HSPs_+__Canis_lupus
LBMM01007576.1_2-HSPs_-__Lasius_niger
NW_019416736.1_1_2-HSPs_-__Cattus_felis
KQ415617.1_114142-115354_+__SPO_E
UXGB01011990.1_1481-2897_-__Apis_mellifera
CM010866.1_742312-745306_-__Cuniculus_griseus
scaffold_10628_4264-5914_-__Rattus_rattus 
IDBA_scaffold30_1_30-466_+__SP_A
IDBA_scaffold43_30-466_+__SP_B

and I would like to use a regex expression in order to extract only the part between :

[part to extract]_Number-HSPs_* or if there is not the HSPs pattern extract [part to extract]_Number*-Number_*

and save it into a COL2 Here I should get :

COL1                                          COL2
scaffold_6202_0_5660-8393_+__Apis_cerana      scaffold_6202_0
scaffold_27087_2-HSPs_+__Canis_lupus          scaffold_27087
LBMM01007576.1_2-HSPs_-__Lasius_niger         LBMM01007576.1
NW_019416736.1_1_2-HSPs_-__Cattus_felis       NW_019416736.1_1
KQ415617.1_114142-115354_+__SPO_E             KQ415617.1
UXGB01011990.1_1481-2897_-__Apis_mellifera    UXGB01011990.1
CM010866.1_742312-745306_-__Cuniculus_griseus CM010866.1
scaffold_10628_4264-5914_-__Rattus_rattus     scaffold_10628
IDBA_scaffold30_1_30-466_+__SP_A              IDBA_scaffold30_1
IDBA_scaffold43_30-466_+__SP_B                IDBA_scaffold43

So far I succeded to use

import re 

df['COL2'] = re.sub(r"_[^0-9]*-Number_", "", df['COL1'])

2 Answers 2

3

For the example data, you might also match either word characters and dots until the last underscore it can match, as a word character also matches an underscore.

^([\w.]+)_

Regex demo

df['COL2'] = df["COL1"].str.extract(r'^([\w.]+)_')
Sign up to request clarification or add additional context in comments.

1 Comment

Very clever, ++
1

Using str.extract:

df["COL2"] = df["COL2"].str.extract('(^.*?(?=_[^_-]+-\w+))')

Demo

4 Comments

I get the error message : ValueError: pattern contains no capture groups
The answer should be working, check the demo
@Grendel Sorry, I left out the mandatory capture group. Please try it again.
It works now thank you for your time and help ! :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.