How to create separate substring column from dataframe column value

Question

I have dataframe like:

Instru,Name
16834306,INFOSYS18SEP640.50PE
16834306,INFOSYS18SEP640.50PE
16834306,BHEL18SEP52.80CE
16834306,BHEL18SEP52.80CE
16834306,IOCL18SEP640PE
16834306,IOCL18SEP640PE

I want create separate column by taking string from Name Column as below:

Instru,Name,Symbol,Month,SP,Type
16834306,INFOSYS18SEP640.50PE,INFOSYS,18SEP,640.50,PE
16834306,INFOSYS18SEP640.50PE,INFOSYS,18SEP,640.50,PE
16834306,BHEL18SEP52.80CE,BHEL,18SEP,52.80,CE    
16834306,BHEL18SEP52.80CE,BHEL,18SEP,52.80,CE
16834306,IOCL18SEP640PE,IOCL,18SEP,640,PE
16834306,IOCL18SEP640PE,IOCL,18SEP,640,PE

Note: Decimal to appear as decimal and int as int for SP Column

is the structure always the same? 4 characters,5 characters, then a flota and then the last two characters? — Yuca
– Yuca, Commented Sep 7, 2018 at 13:53
No Sir, it may vary as per the company symbol. String before '18SEP' and after may be considered. — Pravat
– Pravat, Commented Sep 7, 2018 at 13:54
If you have a regex for it see answer here - stackoverflow.com/questions/46928636/… — Tom Ron
– Tom Ron, Commented Sep 7, 2018 at 14:03
Possible duplicate of Pandas extract numbers from column into new columns — Lev Zakharov
– Lev Zakharov, Commented Sep 7, 2018 at 14:04

piRSquared · Accepted Answer · 2018-09-07 14:37:13Z

5

Using pandas.Series.str.extract with named groups in regex pattern

pat = '(?P<Symbol>.*?)(?P<Month>\d{1,2}\w{3})(?P<SP>[\d\.]+)(?P<Type>.*)'
df.join(df.Name.str.extract(pat))

     Instru                  Name   Symbol  Month      SP Type
0  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.50   PE
1  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.50   PE
2  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.80   CE
3  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.80   CE
4  16834306        IOCL18SEP640PE     IOCL  18SEP     640   PE
5  16834306        IOCL18SEP640PE     IOCL  18SEP     640   PE

Explanation of the regex patter

regex is a funny fuzzy business and is an art form. I'll explain what I did and why. You can compare what I did relative to @jonclements and see that we both attacked the problem with the same approach but made subtly different assumptions.

'(?P<group_name>pattern)' Is a way to create a capture group and name it with 'group_name'
'(?P<Symbol>.*?)' Grabs all characters up to the next capture group, the '?' says don't be greedy about it.
'(?P<Month>\d{1,2}\w{3})' Grabs 1 or 2 digits then 3 letters. The vagueness of 1 or 2 digits is why I made the prior group non-greedy.
'(?P<SP>[\d\.]+)' Grab one or more digits or periods. Admittedly, this isn't terribly graceful as it could grab '4.2.4.5' but it should get the job done.
'(?P<Type>.*)' Plays clean up and grabs the rest.

edited Sep 7, 2018 at 14:37

answered Sep 7, 2018 at 14:07

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alexander Over a year ago

Great answer. It could be useful if you explained the regex patters.

piRSquared Over a year ago

@Alexander in process

Jon Clements · Accepted Answer · 2018-09-07 14:16:23Z

4

You can use str.extract and apply .astype to the result to get your desired columns and your specific numeric column as a float:

separated = df.Name.str.extract(r"""(?ix)
    (?P<Symbol>[a-z]+)     # all letters up to a date that matches
    (?P<Month>\d{2}\w{3})  # the date (2 numbers then 3 letters)
    (?P<SP>.*?)            # everything until the "type"
    (?P<Type>\w{2}$)       # Last two characters of string is the type
""").astype({'SP': 'float'})

Which'll give you:

    Symbol  Month     SP Type
0  INFOSYS  18SEP  640.5   PE
1  INFOSYS  18SEP  640.5   PE
2     BHEL  18SEP   52.8   CE
3     BHEL  18SEP   52.8   CE
4     IOCL  18SEP  640.0   PE
5     IOCL  18SEP  640.0   PE

Then apply df.join(separated) to get your final DF of:

     Instru                  Name   Symbol  Month     SP Type
0  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.5   PE
1  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.5   PE
2  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.8   CE
3  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.8   CE
4  16834306        IOCL18SEP640PE     IOCL  18SEP  640.0   PE
5  16834306        IOCL18SEP640PE     IOCL  18SEP  640.0   PE

edited Sep 7, 2018 at 14:16

answered Sep 7, 2018 at 14:08

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

1 Comment

piRSquared Over a year ago

I like your explanation of the regex pattern. I'll make the same effort and contrast with yours.

Yuca · Accepted Answer · 2018-09-07 14:07:35Z

2

You can define your splitting function and create the desired output

def f(x):
    for i, c in enumerate(x):
        if c.isdigit():        
            break
    return [x[0:i], x[i:9], x[9:-2], x[-2:]]

df[['Symbol','Month','SP','Type']] = pd.DataFrame(df.Name.apply(f).tolist())

     Instru               Name Symbol  Month      SP Type
0  16834306  INFY18SEP640.50PE   INFY  18SEP  640.50   PE
1  16834306  INFY18SEP640.50PE   INFY  18SEP  640.50   PE
2  16834306   BHEL18SEP52.80CE   BHEL  18SEP   52.80   CE
3  16834306   BHEL18SEP52.80CE   BHEL  18SEP   52.80   CE
4  16834306     IOCL18SEP640PE   IOCL  18SEP     640   PE
5  16834306     IOCL18SEP640PE   IOCL  18SEP     640   PE

answered Sep 7, 2018 at 14:07

Yuca

6,1114 gold badges26 silver badges45 bronze badges

Collectives™ on Stack Overflow

How to create separate substring column from dataframe column value

3 Answers 3

Explanation of the regex patter

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Explanation of the regex patter

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related