0

I have an array:

test_arr = np.array([ [1.2, 2.1, 2.3, 4.5],
                      [2.6, 6.4, 5.2, 6.2],
                      [7.2, 6.2, 2.5, 1.7],
                      [8.2, 7.6, 4.2, 7.3] ]

Is it possible to obtain a pandas dataframe of the form:

row_id  | row1  | row2          | row3          | row4
row1      0.0     d(row1,row2)    d(row1,row3)    d(row1,row4)
row2      ...     0.0             ...             ...
row3      ...        ...          0.0             ...
row4      ...        ...          0.0             ...

where d(row1, row2) is the Euclidean distance between row1 and row2.

What I am trying now is first generating a list of all pairs of rows, then computing the distance and assigning each element to the dataframe. Is there a better/faster way of doing this?

1

3 Answers 3

2
from scipy import spatial
import numpy as np

test_arr = np.array([ [1.2, 2.1, 2.3, 4.5],
                      [2.6, 6.4, 5.2, 6.2],
                      [7.2, 6.2, 2.5, 1.7],
                      [8.2, 7.6, 4.2, 7.3] ])

dist = spatial.distance.pdist(test_arr)
spatial.distance.squareform(dist)

Result:

array([[0.        , 5.63471383, 7.79037868, 9.52365476],
       [5.63471383, 0.        , 6.98140387, 5.91692488],
       [7.79037868, 6.98140387, 0.        , 6.1       ],
       [9.52365476, 5.91692488, 6.1       , 0.        ]])
Sign up to request clarification or add additional context in comments.

Comments

2
from sklearn.metrics.pairwise import euclidean_distances
pd.DataFrame(euclidean_distances(test_arr, test_arr))

          0         1         2         3
0  0.000000  5.634714  7.790379  9.523655
1  5.634714  0.000000  6.981404  5.916925
2  7.790379  6.981404  0.000000  6.100000
3  9.523655  5.916925  6.100000  0.000000

Comments

1

Using cdist to compute pairwise distances

Place 2D resulting array into Pandas DataFrame

import numpy as np
from scipy.spatial.distance import cdist
import pandas as pd

test_arr = np.array([ [1.2, 2.1, 2.3, 4.5],
                      [2.6, 6.4, 5.2, 6.2],
                      [7.2, 6.2, 2.5, 1.7],
                      [8.2, 7.6, 4.2, 7.3] ])

    # Use cdist to compute pairwise distances
    dist = cdist(test_arr, test_arr)

    # Place into Pandas DataFrame
    # index and names of columns
    names = ['row' + str(i) for i in range(1, dist.shape[0]+1)]
    df = pd.DataFrame(dist, columns = names, index = names)

    print(df)

Output

Pandas DataFrame

        row1      row2      row3      row4
row1  0.000000  5.634714  7.790379  9.523655
row2  5.634714  0.000000  6.981404  5.916925
row3  7.790379  6.981404  0.000000  6.100000
row4  9.523655  5.916925  6.100000  0.000000

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.