Pandas split a dataframe column

Question

I have the following dataframe where I want to split the Col2 into multiple columns:

Input DataFrame:

>>> mydf= pd.DataFrame({'Col1':['AA','AB','AAC'], 'Col2':['AN||Ind(0.9)','LN||RED(8.9)','RN||RED(9.8)'], 'Col3':['log1','log2','log3']})

>>> mydf
   Col1      Col2      Col3
0   AA  AN||Ind(0.9)  log1
1   AB  LN||RED(8.9)  log2
2  AAC  RN||RED(9.8)  log3

Desired DataFrame:

  Col1  Col2 Col3  Col4  Col5
0   AA   AN  log1  Ind   0.9
1   AB   LN  log2  RED   8.9
2  AAC   RN  log3  RED   9.8

I started with Apply but the following will take a good few steps. Any shortcut?

mydf['Col4']=mydf['Col2'].apply(lambda x: str(x).split('||')[0])

Also little confused why the following throws a valuerror:

mydf['Col2'].str.split('||', expand=True)

ValueError: split() requires a non-empty pattern match.

piRSquared · Accepted Answer · 2017-04-17 20:21:47Z

You can split out the columns with str.extract and assign

regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'
mydf.assign(**mydf.Col2.str.extract(regex, expand=True).to_dict('list'))

  Col1 Col2  Col3 Col4 Col5
0   AA   AN  log1  Ind  0.9
1   AB   LN  log2  RED  8.9
2  AAC   RN  log3  RED  9.8

Or equivalently with combine_first

regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'
mydf.Col2.str.extract(regex, expand=True).combine_first(mydf)

  Col1 Col2  Col3 Col4 Col5
0   AA   AN  log1  Ind  0.9
1   AB   LN  log2  RED  8.9
2  AAC   RN  log3  RED  9.8

explanation

This uses a regular expression to parse the Col2 values and assign column names at the same time

regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'

'(?P<Col2>.*)\|{2,}' will grab everything up to the first double | and call it Col2
'(?P<Col4>.*)' grabs everything up to the parentheses and calls it Col4
'\((?P<Col5>.*)\)' grabs everything inside the parentheses and calls it Col5
finally, we either reassign Col2 overwriting the existing Col2 or we use combine_first where we default to the newly formulated Col2 values.

Vaishali · Accepted Answer · 2017-04-17 20:33:16Z

2

@piRSquared answer is amazing as usual and upvoted, I am just posting my approach. I just kept it very simple

mydf[['Col2', 'Col4', 'Col5']]= mydf.Col2.str.extract('(.*?)\|\|(.*?)\((.*?)\)', expand = True)

Col2 is automatically reassigned so no need to drop a column later.

    Col1    Col2    Col3    Col4    Col5
0   AA      AN      log1    Ind     0.9
1   AB      LN      log2    RED     8.9
2   AAC     RN      log3    RED     9.8

answered Apr 17, 2017 at 20:33

Vaishali

38.5k5 gold badges62 silver badges88 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:02:27Z

1

Using RegEx from the great @piRSquared's solution

In [59]: regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'

In [60]: mydf = mydf.join(mydf.pop('Col2').str.extract(regex, expand=True)) \
                    .sort_index(axis=1)

In [61]: mydf
Out[61]:
  Col1 Col2  Col3 Col4 Col5
0   AA   AN  log1  Ind  0.9
1   AB   LN  log2  RED  8.9
2  AAC   RN  log3  RED  9.8

edited May 23, 2017 at 12:02

CommunityBot

11 silver badge

answered Apr 17, 2017 at 22:06

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

3 Comments

piRSquared Over a year ago

There's that pop again :-)

MaxU - stand with Ukraine Over a year ago

@piRSquared, sorry about that! 8-D

piRSquared Over a year ago

There are a few methods that are very natural that escape my memory when I could use them. I'd like to cement pop in my mind so that it becomes part of the solution space as my brain spins around looking for an answer

Collectives™ on Stack Overflow

Pandas split a dataframe column

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related