2

I have the following dataframe where I want to split the Col2 into multiple columns:

Input DataFrame:

>>> mydf= pd.DataFrame({'Col1':['AA','AB','AAC'], 'Col2':['AN||Ind(0.9)','LN||RED(8.9)','RN||RED(9.8)'], 'Col3':['log1','log2','log3']})

>>> mydf
   Col1      Col2      Col3
0   AA  AN||Ind(0.9)  log1
1   AB  LN||RED(8.9)  log2
2  AAC  RN||RED(9.8)  log3

Desired DataFrame:

  Col1  Col2 Col3  Col4  Col5
0   AA   AN  log1  Ind   0.9
1   AB   LN  log2  RED   8.9
2  AAC   RN  log3  RED   9.8

I started with Apply but the following will take a good few steps. Any shortcut?

mydf['Col4']=mydf['Col2'].apply(lambda x: str(x).split('||')[0])

Also little confused why the following throws a valuerror:

mydf['Col2'].str.split('||', expand=True)

ValueError: split() requires a non-empty pattern match.

3 Answers 3

4

You can split out the columns with str.extract and assign

regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'
mydf.assign(**mydf.Col2.str.extract(regex, expand=True).to_dict('list'))

  Col1 Col2  Col3 Col4 Col5
0   AA   AN  log1  Ind  0.9
1   AB   LN  log2  RED  8.9
2  AAC   RN  log3  RED  9.8

Or equivalently with combine_first

regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'
mydf.Col2.str.extract(regex, expand=True).combine_first(mydf)

  Col1 Col2  Col3 Col4 Col5
0   AA   AN  log1  Ind  0.9
1   AB   LN  log2  RED  8.9
2  AAC   RN  log3  RED  9.8

explanation

This uses a regular expression to parse the Col2 values and assign column names at the same time

regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'
  • '(?P<Col2>.*)\|{2,}' will grab everything up to the first double | and call it Col2
  • '(?P<Col4>.*)' grabs everything up to the parentheses and calls it Col4
  • '\((?P<Col5>.*)\)' grabs everything inside the parentheses and calls it Col5
  • finally, we either reassign Col2 overwriting the existing Col2 or we use combine_first where we default to the newly formulated Col2 values.
Sign up to request clarification or add additional context in comments.

Comments

2

@piRSquared answer is amazing as usual and upvoted, I am just posting my approach. I just kept it very simple

mydf[['Col2', 'Col4', 'Col5']]= mydf.Col2.str.extract('(.*?)\|\|(.*?)\((.*?)\)', expand = True)

Col2 is automatically reassigned so no need to drop a column later.

    Col1    Col2    Col3    Col4    Col5
0   AA      AN      log1    Ind     0.9
1   AB      LN      log2    RED     8.9
2   AAC     RN      log3    RED     9.8

Comments

1

Using RegEx from the great @piRSquared's solution

In [59]: regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'

In [60]: mydf = mydf.join(mydf.pop('Col2').str.extract(regex, expand=True)) \
                    .sort_index(axis=1)

In [61]: mydf
Out[61]:
  Col1 Col2  Col3 Col4 Col5
0   AA   AN  log1  Ind  0.9
1   AB   LN  log2  RED  8.9
2  AAC   RN  log3  RED  9.8

3 Comments

There's that pop again :-)
@piRSquared, sorry about that! 8-D
There are a few methods that are very natural that escape my memory when I could use them. I'd like to cement pop in my mind so that it becomes part of the solution space as my brain spins around looking for an answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.