0

I have a dataframe test which is as below

 Student_Id  Math  Physical  Arts Class Sub_Class
0        id_1     6         7     9     A         x
1        id_2     9         7     1     A         y
2        id_3     3         5     5     C         x
3        id_4     6         8     9     A         x
4        id_5     6         7    10     B         z
5        id_6     9         5    10     B         z
6        id_7     3         5     6     C         x
7        id_8     3         4     6     C         x
8        id_9     6         8     9     A         x
9       id_10     6         7    10     B         z
10      id_11     9         5    10     B         z
11      id_12     3         5     6     C         x

There are two arrays as listed in the My Code section: arr_list and array_top.

I want to create a new column such that it loops through each row of the dataframe and then update the value from the arrays as below:

for index, row in test.iterrows():
      test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]

This looping takes too much of time for a bigger set. Is there a faster way to do this?

My Code

import pandas as pd
import numpy as np

#Ceate dataframe
data = [
    ["id_1",6,7,9, "A", "x"],
    ["id_2",9,7,1, "A","y" ],
    ["id_3",3,5,5, "C", "x"],
    ["id_4",6,8,9, "A","x" ],
    ["id_5",6,7,10, "B", "z"],
    ["id_6",9,5,10,"B", "z"],
    ["id_7",3,5,6, "C", "x"],
    ["id_8",3,4,6, "C", "x"],
    ["id_9",6,8,9, "A","x" ],
    ["id_10",6,7,10, "B", "z"],
    ["id_11",9,5,10,"B", "z"],
    ["id_12",3,5,6, "C", "x"]
    
]

test = pd.DataFrame(data, columns = ['Student_Id', 'Math', 'Physical','Arts', 'Class', 'Sub_Class'])


#Create two arrays which are of same length as the test data
arr_list = np.array([[1, 2, 3], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [1, 2, 3], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6]])

array_top = np.array([[0],[1],[1],[2],[1], [0], [0],[1],[1],[2],[1], [0]])

#Create the column Highest_Scoe
for index, row in test.iterrows():
      test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]
2
  • What is the source of the data for arr_list and array_top? Commented Aug 11, 2021 at 19:18
  • I think you should be able to do this by converting arr_list and array_top to dataframes, then join them with test. Commented Aug 11, 2021 at 19:20

1 Answer 1

1

Looping through the arrays first to create your new column, then just assigning it to the dataframe will be much faster than looping through each row of the dataframe

71.7 µs vs 2.77 ms (a.k.a. 39 times faster) by my time trial

In [95]: %%timeit
    ...: new_test['Highest_Score'] = [arr_list[r][c][0] for r,c in enumerate(array_top)]
    ...:
    ...:
71.7 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [96]: %%timeit
    ...: for index, row in test.iterrows():
    ...:       test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]
    ...:
2.77 ms ± 49.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As a general rule with adding new data to a pandas DataFrame, you want to do all of the looping and compiling outside of pandas, and then assign all of the data all at once

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Jim ! This is pretty fast. Can you please explain a bit on how [r][c] gets updated by enumerate?
Yeah, enumerate() will iterate over whatever is passed to it, and return a tuple of the index and the value of the iterable. I chose r and c as variable names to represent rows and columns that will be selected from arr_list. Looking at your original loop, the Highest Score you want is just in order going down and the column is determined by the value in array_top

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.