Benchmark and Speed up this Python code

Question

Does anyone know why test1b() is so much faster than test1a()? How do you identify which line is the bottleneck and choose the alternative function to speed it up? Please share your experience

import numpy as np
import pandas as pd
import time

def test1a():
    cols = 13
    rows = 10000000
    raw_data = np.random.randint(2, size=cols * rows).reshape(rows, cols)
    col_names = ['v01', 'v02', 'v03', 'v04', 'v05', 'v06', 'v07',
                 'v08', 'v09', 'v10', 'v11', 'v12', 'outcome']
    df = pd.DataFrame(raw_data, columns=col_names)
    df['v11'] = df['v03'].apply(lambda x: ['t1', 't2', 't3', 't4'][np.random.randint(4)])
    df['v12'] = df['v03'].apply(lambda x: ['p1', 'p2'][np.random.randint(2)])
    return df


def test1b():
    cols = 13
    rows = 10000000
    raw_data = np.random.randint(2, size=(rows,cols))
    col_names = ['v01', 'v02', 'v03', 'v04', 'v05', 'v06', 'v07',
                 'v08', 'v09', 'v10', 'v11', 'v12', 'outcome']
    df = pd.DataFrame(raw_data, columns=col_names)
    df['v11'] = np.take(
        np.array(['t1', 't2', 't3', 't4'], dtype=object),
        np.random.randint(4, size=rows))
    df['v12'] = np.take(
        np.array(['p1', 'p2'], dtype=object),
        np.random.randint(2, size=rows))
    return df


start_time = time.time()
test1a()
t1a = time.time() - start_time

start_time = time.time()
test1b()
t1b = time.time() - start_time

print("Test1a: {}sec, Test1b: {}sec".format(t1a, t1b))

pstjohn · Accepted Answer · 2017-10-24 15:30:59Z

2

The line that's slowing you down is the pandas apply function. You could profile it with the ipython %timeit function, just comparing

%timeit df['v11'] = df['v03'].apply(lambda x: ['t1', 't2', 't3', 't4'][np.random.randint(4)])

to

%timeit df['v11'] = np.take(
    np.array(['t1', 't2', 't3', 't4'], dtype=object),
    np.random.randint(4, size=rows))

Ultimately pandas.apply isn't able to vectorize your code the way the numpy implementation can, and results in a lot of overhead in figuring out dtypes and re-invoking the python interpreter at each iteration.

answered Oct 24, 2017 at 15:30

pstjohn

5315 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Benchmark and Speed up this Python code

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related