2

I am creating a column to add a tag to some strings and have working code here:

import pandas as pd
import numpy as np
import re

data=pd.DataFrame({'Lang':["Python", "Cython", "Scipy", "Numpy", "Pandas"], })
data['Type'] = ""


pat = ["^P\w", "^S\w"]

for i in range (len(data.Lang)):
    if re.search(pat[0],data.Lang.ix[i]):
        data.Type.ix[i] = "B"

    if re.search(pat[1],data.Lang.ix[i]):
        data.Type.ix[i]= "A"


print data

Is there a way to get rid of that for loop? if it was numpy there is a function arange something similar to what I am trying to find.

2 Answers 2

6

This will be faster than the apply soln (and the looping soln)

FYI: (this is in 0.13). In 0.12 you would need to create the Type column first.

In [36]: data.loc[data.Lang.str.match(pat[0]),'Type'] = 'B'

In [37]: data.loc[data.Lang.str.match(pat[1]),'Type'] = 'A'

In [38]: data
Out[38]: 
     Lang Type
0  Python    B
1  Cython  NaN
2   Scipy    A
3   Numpy  NaN
4  Pandas    B

[5 rows x 2 columns]

In [39]: data.fillna('')
Out[39]: 
     Lang Type
0  Python    B
1  Cython     
2   Scipy    A
3   Numpy     
4  Pandas    B

[5 rows x 2 columns]

Here's some timings:

In [34]: bigdata = pd.concat([data]*2000,ignore_index=True)

In [35]: def f3(df):
    df = df.copy()
    df['Type'] = ''
    for i in range(len(df.Lang)):
        if re.search(pat[0],df.Lang.ix[i]):
            df.Type.ix[i] = 'B'
        if re.search(pat[1],df.Lang.ix[i]):
            df.Type.ix[i] = 'A'
   ....:             

In [36]: def f2(df):
    df = df.copy()
    df.loc[df.Lang.str.match(pat[0]),'Type'] = 'B'
    df.loc[df.Lang.str.match(pat[1]),'Type'] = 'A'
    df.fillna('')
   ....:     

In [37]: def f1(df):
    df = df.copy()
    f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''
    df['Type'] = df['Lang'].apply(f)
   ....:     

Your original soln

In [41]: %timeit f3(bigdata)
1 loops, best of 3: 2.21 s per loop

Direct indexing

In [42]: %timeit f2(bigdata)
100 loops, best of 3: 17.3 ms per loop

Apply

In [43]: %timeit f1(bigdata)
10 loops, best of 3: 21.8 ms per loop

Here's another more general method that is a bit faster, and prob is more useful as you can then combine the patterns in say a groupby if you wanted.

In [107]: pats
Out[107]: {'A': '^P\\w', 'B': '^S\\w'}

In [108]: concat([df,DataFrame(dict([ (c,Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index)) for c,p in pats.items() ]))],axis=1)
Out[108]: 
      Lang    A    B
0   Python    A  NaN
1   Cython  NaN  NaN
2    Scipy  NaN    B
3    Numpy  NaN  NaN
4   Pandas    A  NaN
5   Python    A  NaN
6   Cython  NaN  NaN

45  Python    A  NaN
46  Cython  NaN  NaN
47   Scipy  NaN    B
48   Numpy  NaN  NaN
49  Pandas    A  NaN
50  Python    A  NaN
51  Cython  NaN  NaN
52   Scipy  NaN    B
53   Numpy  NaN  NaN
54  Pandas    A  NaN
55  Python    A  NaN
56  Cython  NaN  NaN
57   Scipy  NaN    B
58   Numpy  NaN  NaN
59  Pandas    A  NaN
       ...  ...  ...

[10000 rows x 3 columns]

In [106]: %timeit  concat([df,DataFrame(dict([ (c,Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index)) for c,p in pats.items() ]))],axis=1)
100 loops, best of 3: 15.5 ms per loop

This frame tacks on a Series for each of the patters that puts the letter in the correct position (and NaN otherwise).

Create a series of that letter

Series(c,index=df.index)

Select the matches out of it

Series(c,index=df.index)[df.Lang.str.match(p)]

Reindexing puts NaN where the value is not in the index

Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index))
Sign up to request clarification or add additional context in comments.

2 Comments

Wow thanks. One question if I have a lots of patterns is there an efficient function to iterate through them?
I added another (prob better soln)
3

You can do both classifications with one lambda:

f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''

then use apply to get your "Type"

data.Type = data.Lang.apply(f)

output:

     Lang Type
0  Python    A
1  Cython
2   Scipy    B
3   Numpy
4  Pandas    A

Edit: Maybe didn't compare well after benchmarks. If you want to speed things up than just avoid the compile of the regex

def f1(df):
    df = df.copy()
    f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''
    df['Type'] = df['Lang'].apply(f)
    return df

def f1_1(df):
    df = df.copy()
    re1, re2 = re.compile(pat[0]), re.compile(pat[1])
    f = lambda s: re1.match(s) and 'A' or re2.match(s) and 'B' or ''
    df.Type = df.Lang.apply(f)
    return df

bigdata = pd.concat([data]*2000,ignore_index=True)

original Apply:

In [18]:  %timeit f1(bigdata)
10 loops, best of 3: 23.1 ms per loop

revised Apply:

In [19]: %timeit f1_1(bigdata)
100 loops, best of 3: 6.65 ms per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.