return pandas dataframe column with substrings of another column

Question

I have a pandas dataframe that looks like df and I want to add a column so it looks like df2.

import pandas as pd
df =pd.DataFrame({'Alternative' : ['a_x_17MAR2016_Collectedran30dom', 'b_17MAR2016_CollectedStuff', 'c_z_k_17MAR2016_Collectedan3dom'], 'Values': [34, 65, 7]})

df2 = pd.DataFrame({'Alternative' : ['a_x_17MAR2016_Collectedran30dom', 'b_17MAR2016_CollectedStuff', 'c_z_k_17MAR2016_Collectedan3dom'], 'Values': [34, 65, 7], 'Alts': ['a x 17MAR2016', 'b 17MAR2016', 'c z k 17MAR2016']})

    df
Out[4]: 
                       Alternative  Values
0  a_x_17MAR2016_Collectedran30dom      34
1       b_17MAR2016_CollectedStuff      65
2  c_z_k_17MAR2016_Collectedan3dom       7

df2
Out[5]: 
                       Alternative             Alts  Values
0  a_x_17MAR2016_Collectedran30dom    a x 17MAR2016      34
1       b_17MAR2016_CollectedStuff      b 17MAR2016      65
2  c_z_k_17MAR2016_Collectedan3dom  c z k 17MAR2016       7

In other words I have a string that I can separate with an underscore delimeter that is of varying length. I want to separate it, then combine it delimeted by a space, but remove any string(s) after starting with the string containing the substring 'Collected'.

I can locate the index of the string containing the substring 'Collected' in an individual list as I found here and then combine the other strings, but I cannot seem to do it in a very 'pythonic' way across all of the dataframe.

Thanks in advance

Ted Petrou · Accepted Answer · 2016-12-07 19:47:24Z

2

I believe this would technically answer the question but not match the desired output as the date does not contain the word 'Collected'

df.Alternative.str.replace('_[^_]*Collected.*', '').str.replace('_', ' ')

Output

0      a x 17MAR2016
1        b 17MAR2016
2    c z k 17MAR2016

edited Dec 7, 2016 at 19:47

answered Dec 7, 2016 at 19:36

Ted Petrou

62.4k19 gold badges139 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jeff Tilton Over a year ago

Sorry for the confusion, but I feel this answers the question and uses pandas without the need to import re. I appreciate the help

piRSquared · Accepted Answer · 2016-12-08 05:23:54Z

2

use
str.split

alts = df.Alternative.str.split('_').str[:-1].str.join(' ')
df.insert(1, 'Alts', alts)
df

answered Dec 8, 2016 at 5:23

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Comments

joel.wilson · Accepted Answer · 2016-12-07 19:45:47Z

0

import re
x = df.Alternative.apply(lambda x : re.sub("_Collected.*","",x))
# x
#0      a_x_17MAR2016
#1        b_17MAR2016
#2    c_z_k_17MAR2016

y = x.str.split("_")
#0       [a, x, 17MAR2016]
#1          [b, 17MAR2016]
#2    [c, z, k, 17MAR2016] 

df['newcol'] = y.apply(lambda z: ' '.join(z))
#                       Alternative  Values           newcol
#0  a_x_17MAR2016_Collectedran30dom      34    a x 17MAR2016
#1       b_17MAR2016_CollectedStuff      65      b 17MAR2016
#2  c_z_k_17MAR2016_Collectedan3dom       7  c z k 17MAR2016

all in one line :

import re
df['newcol'] = df.Alternative.apply(lambda x : re.sub("_Collected.*","",x)).str.split("_").apply(lambda z: ' '.join(z))

#                       Alternative  Values           newcol
#0  a_x_17MAR2016_Collectedran30dom      34    a x 17MAR2016
#1       b_17MAR2016_CollectedStuff      65      b 17MAR2016
#2  c_z_k_17MAR2016_Collectedan3dom       7  c z k 17MAR2016

edited Dec 7, 2016 at 19:45

answered Dec 7, 2016 at 17:54

joel.wilson

8,4535 gold badges30 silver badges49 bronze badges

4 Comments

Ted Petrou Over a year ago

This doesn't specifically search for the string 'Collected'

joel.wilson Over a year ago

but to get the expected output as you shared, you might not need to search for Collected right?

Ted Petrou Over a year ago

Yes, your output matches the output desired from Jeff Tilton but it doesn't answer his question. My answer answers his question but doesn't match his output. He needs to clarify things a bit.

Ted Petrou Over a year ago

This isn't my question but your solution still has problems when you the word 'Collected' does not immediately follow an underscore.

Collectives™ on Stack Overflow

return pandas dataframe column with substrings of another column

3 Answers 3

1 Comment

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related