-1

Hello I have a strings such as :

liste_to_split=['NW_011625257.1_0','scaffold1_3','scaffold3']

and I would like to split them at the Number_Number I tried :

for i in liste_to_split:
 i.split(r'(?<=[0-9])_')

and I got

['NW_011625257.1_0']
['scaffold1_3']
['scaffold3']

instead of

['NW_011625257.1'] ['0']
['scaffold1'] ['3']
['scaffold3']

does someone knows where is the issue ?

2

2 Answers 2

2

You may use:

>>> import re
>>> liste_to_split=['NW_011625257.1_0','scaffold1_3','scaffold3']
>>> 
>>> for i in liste_to_split:
...     re.split(r'(?<=[0-9])_', i)
...
['NW_011625257.1', '0']
['scaffold1', '3']
['scaffold3']

Note use of re.split instead of string.split and using _ outside lookbehind assertion to make sure we are not splitting on a zero width match.


Based on OP's comment below it seems OP wants to do this splitting for a dataframe column. In that case use:

Assuming this is your dataframe:

>>> print (df)
             column
0  NW_011625257.1_0
1       scaffold1_3
2         scaffold3

Then you can use:

>>> print (df['column'].str.split(r'(?<=[0-9])_', expand=True))
                0     1
0  NW_011625257.1     0
1       scaffold1     3
2       scaffold3  None
Sign up to request clarification or add additional context in comments.

5 Comments

and what if it is on dataframe ? I can still use re ? example df['column'].re.split(i.split(r'(?<=[0-9])_')
Since you haven't shown a dataframe that's answer is based on code provided in question. But yes dataframe also has a split but in different form. Please update your question so that I can help further.
You can use: df['column'].str.split(r'(?<=[0-9])_') or check updated answer.
Please reclose the question, it is still a dupe of stackoverflow.com/questions/48919003/pandas-split-on-regex?rq=1 and as written now, still a dupe of stackoverflow.com/questions/13209288/…
I rarely reopen dupe questions but it was one of them due to mismatch between title and question body. I have already requested OP to edit the question. Once question is edited I will revisit to check dupe with this link.
1
l=['NW_011625257.1_0','scaffold1_3','scaffold3']

for i in l:
  f = i.split('_')
  print(f) 

output

['NW', '011625257.1', '0']
['scaffold1', '3']
['scaffold3']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.