2

I am new to Python and machine learning. I can't find best way on the internet. I have a big 2d array (distance_matrix.shape= (47, 1328624)). I wrote below code but it takes too long time to run. For loop in for loop takes so time.

distance_matrix = [[0.21218192, 0.12845819, 0.54545613, 0.92464129, 0.12051526, 0.0870853 ], [0.2168166 , 0.11174682, 0.58193855, 0.93949729, 0.08060061, 0.11963891], [0.23996999, 0.17554854, 0.60833433, 0.93914766, 0.11631545, 0.2036373]]
                    
iskeleler = pd.DataFrame({
    'lat':[40.992752,41.083202,41.173462],
    'lon':[29.023165,29.066652,29.088163],
    'name':['Kadıköy','AnadoluHisarı','AnadoluKavağı']
}, dtype=str)

for i in range(len(distance_matrix)):
    for j in range(len(distance_matrix[0])):
        if distance_matrix[i][j] < 1:
            iskeleler.loc[i,'Address'] = distance_matrix[i][j]
        
print(iskeleler)

To explain, I am sharing the first 5 rows of my array and showing my dataframe. İskeleler dataframe distance_matrix

The "İskeleler" dataframe has 47 rows. I want to add them to the 'Address' column in row i in the "İskeleler" by looking at all the values in row i in the distance_matrix and adding the ones less than 1. I mean if we look at the first row in the distance_matrix photo, I want to add the numbers like 0.21218192 + 0.12845819 + 0.54545613 .... and put them in the 'address' column in the i'th row in the İskeleler dataframe.

My intend is to loop through distance_matrix and find some values which smaller than 1. The code takes too long. How can i do this with faster way?

10
  • please check this: stackoverflow.com/questions/9786102/… Commented Apr 22, 2021 at 9:42
  • Use numpy? You already import it. You also want to give us some code that actually runs. IMHO the use of uninitialized distance_matrix in line 2, 3, and 4 and iskeleler in line 5 and 7 gives an error Commented Apr 22, 2021 at 9:43
  • @ThomasWeller Actually I shared my code so you can understand it. Because I pulled both arrays from the internet. It would be a very long post if I shared the part I initialized with you. The question I'm asking is actually a theoretical question. It takes a lot of time to calculate by putting two for loops inside each other. I can't even see it working because an array of mine is too big (that's why I shared its shape). How can I do without two loops, actually that's my question. Commented Apr 22, 2021 at 10:00
  • You want to set iskeleler.loc equal to the last element less than 1 on each line of distance_matrix? Commented Apr 22, 2021 at 10:20
  • @MarkSetchell I edited my post to answer your question. Commented Apr 22, 2021 at 10:34

1 Answer 1

2

I think you mean this:

import numpy as np

# Set up some dummy data in range 0..100
distance = np.random.rand(47,1328624) * 100.0

# Boolean mask of all values < 1
mLessThan1 = distance<1

# Sum elements <1 across rows 
result = np.sum(distance*mLessThan1, axis=1)

That takes 168ms on my Mac.

In [47]: %timeit res = np.sum(distance*mLessThan1, axis=1)
168 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot. It works fine. Sorry if it took me too long to tell you. I am not a native speaker of English and I'm just getting used to python.
No problems - good luck with your project! Avoid for loops with large Numpy arrays. Come back and ask another question if you get stuck - questions (and answers) are free 😀

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.