regex in string pandas (split)

Question

Hello I have a strings such as :

liste_to_split=['NW_011625257.1_0','scaffold1_3','scaffold3']

and I would like to split them at the Number_Number I tried :

for i in liste_to_split:
 i.split(r'(?<=[0-9])_')

and I got

['NW_011625257.1_0']
['scaffold1_3']
['scaffold3']

instead of

['NW_011625257.1'] ['0']
['scaffold1'] ['3']
['scaffold3']

does someone knows where is the issue ?

A duplicate of stackoverflow.com/questions/48919003/pandas-split-on-regex?rq=1 and stackoverflow.com/questions/13209288/… — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 28, 2020 at 10:47

anubhava · Accepted Answer · 2020-10-28 10:44:02Z

2

You may use:

>>> import re
>>> liste_to_split=['NW_011625257.1_0','scaffold1_3','scaffold3']
>>> 
>>> for i in liste_to_split:
...     re.split(r'(?<=[0-9])_', i)
...
['NW_011625257.1', '0']
['scaffold1', '3']
['scaffold3']

Note use of re.split instead of string.split and using _ outside lookbehind assertion to make sure we are not splitting on a zero width match.

Based on OP's comment below it seems OP wants to do this splitting for a dataframe column. In that case use:

Assuming this is your dataframe:

>>> print (df)
             column
0  NW_011625257.1_0
1       scaffold1_3
2         scaffold3

Then you can use:

>>> print (df['column'].str.split(r'(?<=[0-9])_', expand=True))
                0     1
0  NW_011625257.1     0
1       scaffold1     3
2       scaffold3  None

edited Oct 28, 2020 at 10:44

answered Oct 28, 2020 at 10:27

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

chippycentra Over a year ago

and what if it is on dataframe ? I can still use re ? example df['column'].re.split(i.split(r'(?<=[0-9])_')

anubhava Over a year ago

Since you haven't shown a dataframe that's answer is based on code provided in question. But yes dataframe also has a split but in different form. Please update your question so that I can help further.

anubhava Over a year ago

You can use: df['column'].str.split(r'(?<=[0-9])_') or check updated answer.

Wiktor Stribiżew Over a year ago

Please reclose the question, it is still a dupe of stackoverflow.com/questions/48919003/pandas-split-on-regex?rq=1 and as written now, still a dupe of stackoverflow.com/questions/13209288/…

anubhava Over a year ago

I rarely reopen dupe questions but it was one of them due to mismatch between title and question body. I have already requested OP to edit the question. Once question is edited I will revisit to check dupe with this link.

user13824946 · Accepted Answer · 2020-10-28 10:39:14Z

1

l=['NW_011625257.1_0','scaffold1_3','scaffold3']

for i in l:
  f = i.split('_')
  print(f)

output

['NW', '011625257.1', '0']
['scaffold1', '3']
['scaffold3']

answered Oct 28, 2020 at 10:39

user13824946

Collectives™ on Stack Overflow

regex in string pandas (split)

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related