2

I have dataframe like:

Instru,Name
16834306,INFOSYS18SEP640.50PE
16834306,INFOSYS18SEP640.50PE
16834306,BHEL18SEP52.80CE
16834306,BHEL18SEP52.80CE
16834306,IOCL18SEP640PE
16834306,IOCL18SEP640PE

I want create separate column by taking string from Name Column as below:

Instru,Name,Symbol,Month,SP,Type
16834306,INFOSYS18SEP640.50PE,INFOSYS,18SEP,640.50,PE
16834306,INFOSYS18SEP640.50PE,INFOSYS,18SEP,640.50,PE
16834306,BHEL18SEP52.80CE,BHEL,18SEP,52.80,CE    
16834306,BHEL18SEP52.80CE,BHEL,18SEP,52.80,CE
16834306,IOCL18SEP640PE,IOCL,18SEP,640,PE
16834306,IOCL18SEP640PE,IOCL,18SEP,640,PE

Note: Decimal to appear as decimal and int as int for SP Column

4
  • is the structure always the same? 4 characters,5 characters, then a flota and then the last two characters? Commented Sep 7, 2018 at 13:53
  • No Sir, it may vary as per the company symbol. String before '18SEP' and after may be considered. Commented Sep 7, 2018 at 13:54
  • If you have a regex for it see answer here - stackoverflow.com/questions/46928636/… Commented Sep 7, 2018 at 14:03
  • Possible duplicate of Pandas extract numbers from column into new columns Commented Sep 7, 2018 at 14:04

3 Answers 3

5

Using pandas.Series.str.extract with named groups in regex pattern

pat = '(?P<Symbol>.*?)(?P<Month>\d{1,2}\w{3})(?P<SP>[\d\.]+)(?P<Type>.*)'
df.join(df.Name.str.extract(pat))

     Instru                  Name   Symbol  Month      SP Type
0  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.50   PE
1  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.50   PE
2  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.80   CE
3  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.80   CE
4  16834306        IOCL18SEP640PE     IOCL  18SEP     640   PE
5  16834306        IOCL18SEP640PE     IOCL  18SEP     640   PE

Explanation of the regex patter

regex is a funny fuzzy business and is an art form. I'll explain what I did and why. You can compare what I did relative to @jonclements and see that we both attacked the problem with the same approach but made subtly different assumptions.

  • '(?P<group_name>pattern)' Is a way to create a capture group and name it with 'group_name'
  • '(?P<Symbol>.*?)' Grabs all characters up to the next capture group, the '?' says don't be greedy about it.
  • '(?P<Month>\d{1,2}\w{3})' Grabs 1 or 2 digits then 3 letters. The vagueness of 1 or 2 digits is why I made the prior group non-greedy.
  • '(?P<SP>[\d\.]+)' Grab one or more digits or periods. Admittedly, this isn't terribly graceful as it could grab '4.2.4.5' but it should get the job done.
  • '(?P<Type>.*)' Plays clean up and grabs the rest.
Sign up to request clarification or add additional context in comments.

2 Comments

Great answer. It could be useful if you explained the regex patters.
@Alexander in process
4

You can use str.extract and apply .astype to the result to get your desired columns and your specific numeric column as a float:

separated = df.Name.str.extract(r"""(?ix)
    (?P<Symbol>[a-z]+)     # all letters up to a date that matches
    (?P<Month>\d{2}\w{3})  # the date (2 numbers then 3 letters)
    (?P<SP>.*?)            # everything until the "type"
    (?P<Type>\w{2}$)       # Last two characters of string is the type
""").astype({'SP': 'float'})

Which'll give you:

    Symbol  Month     SP Type
0  INFOSYS  18SEP  640.5   PE
1  INFOSYS  18SEP  640.5   PE
2     BHEL  18SEP   52.8   CE
3     BHEL  18SEP   52.8   CE
4     IOCL  18SEP  640.0   PE
5     IOCL  18SEP  640.0   PE

Then apply df.join(separated) to get your final DF of:

     Instru                  Name   Symbol  Month     SP Type
0  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.5   PE
1  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.5   PE
2  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.8   CE
3  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.8   CE
4  16834306        IOCL18SEP640PE     IOCL  18SEP  640.0   PE
5  16834306        IOCL18SEP640PE     IOCL  18SEP  640.0   PE

1 Comment

I like your explanation of the regex pattern. I'll make the same effort and contrast with yours.
2

You can define your splitting function and create the desired output

def f(x):
    for i, c in enumerate(x):
        if c.isdigit():        
            break
    return [x[0:i], x[i:9], x[9:-2], x[-2:]]

df[['Symbol','Month','SP','Type']] = pd.DataFrame(df.Name.apply(f).tolist())

     Instru               Name Symbol  Month      SP Type
0  16834306  INFY18SEP640.50PE   INFY  18SEP  640.50   PE
1  16834306  INFY18SEP640.50PE   INFY  18SEP  640.50   PE
2  16834306   BHEL18SEP52.80CE   BHEL  18SEP   52.80   CE
3  16834306   BHEL18SEP52.80CE   BHEL  18SEP   52.80   CE
4  16834306     IOCL18SEP640PE   IOCL  18SEP     640   PE
5  16834306     IOCL18SEP640PE   IOCL  18SEP     640   PE

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.