Join dataframe with matrix output using pandas

Question

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.

There are several cell number based files with distance values shown in matrix_df .

The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.

inp_df
A       B           cell
100     200         1
115     270         1
145     255         2
115     266         1

matrix_df (cell_1.csv)
B           100     115     199     avg_distance
200         7.5     80.7    67.8        52
270         6.8     53      92          50
266         58      84      31          57

matrix_df (cell_2.csv)
B            145    121     166     avg_distance
255          74.9   77.53   8       53.47



out_df dataframe
A       B           cell    distance    avg_distance
100     200         1       7.5         52
115     270         1       53          50
145     255         2       74.9        53.47
115     266         1       84          57

My current thought process for each cell# based data is

use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.

But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary

If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .

NB: In inp_df the value of column B is unique and values of column A may or may not be unique

Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.

dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)

cs95 · Accepted Answer · 2017-08-02 22:24:55Z

1

Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge

In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)

Step 2: Create the distance column with df.apply by using A's values to index into the correct column

In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
                                          [['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]: 
     A    B  cell  distance  avg_distance
0  100  200     1       7.5         52.00
1  115  270     1      53.0         50.00
2  115  266     1      84.0         57.00
3  145  255     2      74.9         53.47

answered Aug 2, 2017 at 22:24

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

stormfield Over a year ago

I would need to process cell# wise since I thought I can multiprocess the read part to speed things up. So came up with a variation of your approach

stormfield Over a year ago

pastebin link. FYI i know that is not how multiprocecssing works ,but wanted to illustrate similar approach. @cᴏʟᴅsᴘᴇᴇᴅ : Thanks for your answer please let me know if this approach is also fine.

cs95 Over a year ago

@stormfield Breaking up the merge operation will probably be slower because pandas has a lot of speed boosts that you may forego. I'm not sure but it might be slower. Feel free to mark accepted of this helped.

Collectives™ on Stack Overflow

Join dataframe with matrix output using pandas

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related