How to extract sections of a string in a pandas dataframe column

Question

I have a dataframe, df, where I would like specific separations of values within my column to display the first word and the number along with its 'T' value. I would like the first 'word' that is separated by '-', and its #T value. With the exception of 'Azure' case, where the first word is separated by '_'

It is tricky because some of the #T values are separated by '-', while others are separated by '_' ex. -12T in one of the values , as well as _14T in another value I would like to maintain the original values in the type column

Sample Data

data = {'type': ['Azure_Standard_E64is_v4_SPECIAL_DB-A.0', 'Azure_Standard_E64is_v4_SPECIAL_DB-A.0', 'Hello-HEL-HE-A6123-123A-12T_TYPE-v.A', 'Hello-HEL-HE-A6123-123A-12T_TYPE-v.E', 'Hello-HEL-HE-A6123-123A-50T_TYPE-v.C', 'Hello-HEL-HE-A6123-123A-50T_TYPE-v.A', 'Happy-HAP-HA-R650-570A-90T_version-v.A', 'Kind-KIN-KI-T490-NET_14T-A.0', 'Kind-KIN-KI-T490-NET_14T-A.0', 'AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A', 'AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A'], 'free': [6, 5, 10, 5, 1, 2, 10, 7, 6, 3, 0], 'use': [1, 1, 10, 1, 4, 1, 0, 4, 3, 0, 20], 'total': [7, 6, 20, 6, 5, 1, 10, 3, 2, 3, 20]}
df = pd.DataFrame(data)


                                      type  free  use  total
0   Azure_Standard_E64is_v4_SPECIAL_DB-A.0     6    1      7
1   Azure_Standard_E64is_v4_SPECIAL_DB-A.0     5    1      6
2     Hello-HEL-HE-A6123-123A-12T_TYPE-v.A    10   10     20
3     Hello-HEL-HE-A6123-123A-12T_TYPE-v.E     5    1      6
4     Hello-HEL-HE-A6123-123A-50T_TYPE-v.C     1    4      5
5     Hello-HEL-HE-A6123-123A-50T_TYPE-v.A     2    1      1
6   Happy-HAP-HA-R650-570A-90T_version-v.A    10    0     10
7             Kind-KIN-KI-T490-NET_14T-A.0     7    4      3
8             Kind-KIN-KI-T490-NET_14T-A.0     6    3      2
9      AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A     3    0      3
10     AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A     0   20     20

Desired:

   Name                                          type                free   use  total
  
   Azure_Standard_E64is_v4_SPECIAL_DB-A.0        Azure               6       1    7       
   Azure_Standard_E64is_v4_SPECIAL_DB-A.0        Azure               5       1    6                                       
   Hello-HEL-HE-A6123-123A-12T_TYPE-v.A          Hello   12T         10      10  20
   Hello-HEL-HE-A6123-123A-12T_TYPE-v.E          Hello   12T         5       1    6
   Hello-HEL-HE-A6123-123A-50T_TYPE-v.C          Hello   50T         1       4    5
   Hello-HEL-HE-A6123-123A-50T_TYPE-v.A          Hello   50T         2       1    1
   Happy-HAP-HA-R650-570A-90T_version-v.A        Happy   90T         10      0   10
   Kind-KIN-KI-T490-NET_14T-A.0                  Kind    14T         7      4    3
   Kind-KIN-KI-T490-NET_14T-A.0                  Kind    14T         6      3    2
   AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A           AY14.5  6.4T        3      0    3
   AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A           AY14.5  6.4T        0      20   20

Doing:

df['type']= df['type'].str.extract(r'(^\w+.\d|^\w+)')+' '+df['type'].str.extract(r'(\d.\d+T|\d+T)')

This works below, however, the 'AZURE' value disappears, and the original value is not maintained. I am still researching this, any assistance is appreciated.

use df['type'].str.extract(r'(\d.\d+T|\d+T)').fillna('') instead of df['type'].str.extract(r'(\d.\d+T|\d+T)'), then the 'AZURE' value will not disappear. — Ferris
– Ferris, Commented Jan 12, 2021 at 6:20

jezrael · Accepted Answer · 2021-01-12 06:12:28Z

2

You can use Series.str.replace with Series.str.cat and last add Series.str.strip, also is added expand=False to Series.str.extract for Series.

For new column for second position is used DataFrame.insert.

s = (df['type'].str.replace('_','-')
               .str.extract(r'(^\w+.\d|^\w+)', expand=False)
               .str.cat(df['type'].str.extract(r'(\d.\d+T|\d+T)', expand=False), 
                        sep=' ', 
                        na_rep='')
               .str.strip())

Thank you @Trenton McKinney for another solution - splitting values and get first one values of lists:

s = (df['type'].str.split('_|-')
               .str[0]
               .str.cat(df['type'].str.extract(r'(\d.\d+T|\d+T)', expand=False), 
                        sep=' ', 
                        na_rep='')
               .str.strip())

df = df.rename(columns={'type': 'Name'})
df.insert(1, 'type', s)
print (df)
                                      Name         type  free  use  total
0   Azure_Standard_E64is_v4_SPECIAL_DB-A.0        Azure     6    1      7
1   Azure_Standard_E64is_v4_SPECIAL_DB-A.0        Azure     5    1      6
2     Hello-HEL-HE-A6123-123A-12T_TYPE-v.A    Hello 12T    10   10     20
3     Hello-HEL-HE-A6123-123A-12T_TYPE-v.E    Hello 12T     5    1      6
4     Hello-HEL-HE-A6123-123A-50T_TYPE-v.C    Hello 50T     1    4      5
5     Hello-HEL-HE-A6123-123A-50T_TYPE-v.A    Hello 50T     2    1      1
6   Happy-HAP-HA-R650-570A-90T_version-v.A    Happy 90T    10    0     10
7             Kind-KIN-KI-T490-NET_14T-A.0     Kind 14T     7    4      3
8             Kind-KIN-KI-T490-NET_14T-A.0     Kind 14T     6    3      2
9      AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A  AY14.5 6.4T     3    0      3
10     AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A  AY14.5 6.4T     0   20     20

edited Jan 12, 2021 at 6:12

answered Jan 12, 2021 at 5:53

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Lynn Over a year ago

Ok thank you- is there a way to maintain the original values in that type column? I will try this

Trenton McKinney Over a year ago

df['type'].str.replace('_','-').str.split('-', expand=True)[0] also works for the first part

jezrael Over a year ago

@TrentonMcKinney - thank you, I a bit change it, but your idea is used.

Trenton McKinney Over a year ago

@Lynn It's to bad you don't need the 'DA'. I noticed that the group of words you want is always at index 5, if you split the string. So the entire thing could be something like df['type'].str.split('_|-', expand=True).iloc[:, [0, 5]]. However, the excellent answer from jezrael gives you exactly what you want.

Lynn Over a year ago

thank you for the assistance with this- I am trying this now

Collectives™ on Stack Overflow

How to extract sections of a string in a pandas dataframe column

Sample Data

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Sample Data

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related