Pandas DataFrame get substrings from column

Question

I have a column named "KL" with for example:

sem_0405M4209F2057_1.000
sem_A_0103M5836F4798_1.000

Now I want to extract the four digits after "M" and the four digits after "F". But with df["KL"].str.extract I can't get it to work.

Locations of M and F vary, thus just using the slice [9:13] won't work for the complete column.

Alex Riley · Accepted Answer · 2015-07-31 09:35:56Z

1

If you want to use str.extract, here's how:

>>> df['KL'].str.extract(r'M(?P<M>[0-9]{4})F(?P<F>[0-9]{4})')
      M     F
0  4209  2057
1  5836  4798

Here, M(?P<M>[0-9]{4}) matches the character 'M' and then captures 4 digits following it (the [0-9]{4} part). This is put in the column M (specified with ?P<M> inside the capturing group). The same thing is done for F.

edited Jul 31, 2015 at 9:35

answered Jul 31, 2015 at 9:32

Alex Riley

178k46 gold badges274 silver badges247 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

EdChum · Accepted Answer · 2015-07-31 09:23:16Z

0

You could use split to achieve this, probably a better way exists:

In [147]:
s = pd.Series(['sem_0405M4209F2057_1.000','sem_A_0103M5836F4798_1.000'])
s

Out[147]:
0      sem_0405M4209F2057_1.000
1    sem_A_0103M5836F4798_1.000
dtype: object

In [153]:
m = s.str.split('M').str[1].str.split('F').str[0][:4]
f = s.str.split('M').str[1].str.split('F').str[1].str[:4]
print(m)
print(f)

0    4209
1    5836
dtype: object

0    2057
1    4798
dtype: object

answered Jul 31, 2015 at 9:23

EdChum

397k204 gold badges836 silver badges583 bronze badges

Comments

DeepSpace · Accepted Answer · 2015-07-31 09:37:34Z

0

You can also use regex:

import re

def get_data(x):
    data = re.search( r'M(\d{4})F(\d{4})', x)
    if data:
        m = data.group(1)
        f = data.group(2)

        return m, f

df = pd.DataFrame(data={'a': ['sem_0405M4209F2057_1.000', 'sem_0405M4239F2027_1.000']})

df['data'] = df['a'].apply(lambda x: get_data(x))

>>
                          a          data
0  sem_0405M4209F2057_1.000  (4209, 2057)
1  sem_0405M4239F2027_1.000  (4239, 2027)

edited Jul 31, 2015 at 9:37

answered Jul 31, 2015 at 9:31

DeepSpace

82.1k12 gold badges119 silver badges166 bronze badges

Collectives™ on Stack Overflow

Pandas DataFrame get substrings from column

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related