0

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.

There are several cell number based files with distance values shown in matrix_df .

The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.

inp_df
A       B           cell
100     200         1
115     270         1
145     255         2
115     266         1

matrix_df (cell_1.csv)
B           100     115     199     avg_distance
200         7.5     80.7    67.8        52
270         6.8     53      92          50
266         58      84      31          57

matrix_df (cell_2.csv)
B            145    121     166     avg_distance
255          74.9   77.53   8       53.47



out_df dataframe
A       B           cell    distance    avg_distance
100     200         1       7.5         52
115     270         1       53          50
145     255         2       74.9        53.47
115     266         1       84          57

My current thought process for each cell# based data is

  1. use a apply function to go row by row
  2. then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.

But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary

If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .

NB: In inp_df the value of column B is unique and values of column A may or may not be unique

Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.

dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
0

1 Answer 1

1

Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge

In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)

Step 2: Create the distance column with df.apply by using A's values to index into the correct column

In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
                                          [['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]: 
     A    B  cell  distance  avg_distance
0  100  200     1       7.5         52.00
1  115  270     1      53.0         50.00
2  115  266     1      84.0         57.00
3  145  255     2      74.9         53.47
Sign up to request clarification or add additional context in comments.

3 Comments

I would need to process cell# wise since I thought I can multiprocess the read part to speed things up. So came up with a variation of your approach
pastebin link. FYI i know that is not how multiprocecssing works ,but wanted to illustrate similar approach. @cᴏʟᴅsᴘᴇᴇᴅ : Thanks for your answer please let me know if this approach is also fine.
@stormfield Breaking up the merge operation will probably be slower because pandas has a lot of speed boosts that you may forego. I'm not sure but it might be slower. Feel free to mark accepted of this helped.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.