1

I have a pandas dataframe that looks like df and I want to add a column so it looks like df2.

import pandas as pd
df =pd.DataFrame({'Alternative' : ['a_x_17MAR2016_Collectedran30dom', 'b_17MAR2016_CollectedStuff', 'c_z_k_17MAR2016_Collectedan3dom'], 'Values': [34, 65, 7]})

df2 = pd.DataFrame({'Alternative' : ['a_x_17MAR2016_Collectedran30dom', 'b_17MAR2016_CollectedStuff', 'c_z_k_17MAR2016_Collectedan3dom'], 'Values': [34, 65, 7], 'Alts': ['a x 17MAR2016', 'b 17MAR2016', 'c z k 17MAR2016']})

    df
Out[4]: 
                       Alternative  Values
0  a_x_17MAR2016_Collectedran30dom      34
1       b_17MAR2016_CollectedStuff      65
2  c_z_k_17MAR2016_Collectedan3dom       7

df2
Out[5]: 
                       Alternative             Alts  Values
0  a_x_17MAR2016_Collectedran30dom    a x 17MAR2016      34
1       b_17MAR2016_CollectedStuff      b 17MAR2016      65
2  c_z_k_17MAR2016_Collectedan3dom  c z k 17MAR2016       7

In other words I have a string that I can separate with an underscore delimeter that is of varying length. I want to separate it, then combine it delimeted by a space, but remove any string(s) after starting with the string containing the substring 'Collected'.

I can locate the index of the string containing the substring 'Collected' in an individual list as I found here and then combine the other strings, but I cannot seem to do it in a very 'pythonic' way across all of the dataframe.

Thanks in advance

3 Answers 3

2

I believe this would technically answer the question but not match the desired output as the date does not contain the word 'Collected'

df.Alternative.str.replace('_[^_]*Collected.*', '').str.replace('_', ' ')

Output

0      a x 17MAR2016
1        b 17MAR2016
2    c z k 17MAR2016
Sign up to request clarification or add additional context in comments.

1 Comment

Sorry for the confusion, but I feel this answers the question and uses pandas without the need to import re. I appreciate the help
2

use
str.split

alts = df.Alternative.str.split('_').str[:-1].str.join(' ')
df.insert(1, 'Alts', alts)
df

enter image description here

Comments

0
import re
x = df.Alternative.apply(lambda x : re.sub("_Collected.*","",x))
# x
#0      a_x_17MAR2016
#1        b_17MAR2016
#2    c_z_k_17MAR2016

y = x.str.split("_")
#0       [a, x, 17MAR2016]
#1          [b, 17MAR2016]
#2    [c, z, k, 17MAR2016] 

df['newcol'] = y.apply(lambda z: ' '.join(z))
#                       Alternative  Values           newcol
#0  a_x_17MAR2016_Collectedran30dom      34    a x 17MAR2016
#1       b_17MAR2016_CollectedStuff      65      b 17MAR2016
#2  c_z_k_17MAR2016_Collectedan3dom       7  c z k 17MAR2016

all in one line :

import re
df['newcol'] = df.Alternative.apply(lambda x : re.sub("_Collected.*","",x)).str.split("_").apply(lambda z: ' '.join(z))

#                       Alternative  Values           newcol
#0  a_x_17MAR2016_Collectedran30dom      34    a x 17MAR2016
#1       b_17MAR2016_CollectedStuff      65      b 17MAR2016
#2  c_z_k_17MAR2016_Collectedan3dom       7  c z k 17MAR2016

4 Comments

This doesn't specifically search for the string 'Collected'
but to get the expected output as you shared, you might not need to search for Collected right?
Yes, your output matches the output desired from Jeff Tilton but it doesn't answer his question. My answer answers his question but doesn't match his output. He needs to clarify things a bit.
This isn't my question but your solution still has problems when you the word 'Collected' does not immediately follow an underscore.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.