Compare and match values from two df and multiple columns

Question

I've got two dataframes with data about popular stores and districts where they are located. Each store is kind of a chain and may have more than one district location id (for example "Store1" has several stores in different places).

First df has info about top-5 most popular stores and district ids separated by semicolon, for example:

store_name district_id
Store1 |  1;2;3;4;5
Store2 |  1;2
Store3 |  3
Store4 |  4;7;10;15
Store5 |  12;15;

Second df has only two columns with ALL districts in city and each row is unique district id and it's name.


district_id  district_name
1           |  District1
2           |  District2
3           |  District3
4           |  District4
5           |  District5
6           |  District6
7           |  District7
8           |  District8
9           |  District9
10          | District10
etc.

The goal is to create columns in df1 for every store in top-5 and match every district id number to district name.

So, firstly I splitted df1 into form like this:

store_name district_id 0   1   2   3   4   5 
Store1    |    1     | 2 | 3 | 4 | 5
Store2    |    1     | 2 |   |   |  
Store3    |    3     |   |   |   |
Store4    |    4     | 7 | 10| 15| 
Store5    |    12    | 15|

But now I'm stucked and don't know how to match each value from df1 to df2 and get district names for each id. Empty cells is None, because columns were created by maximum values for each store.

I would like to get df like this:

store_name district_name district_name2 district_name3 district_name4 district_name5 
Store1     | District1   | District2   | District3   | District4     | District5
Store2     | District1   | District2   |             |               |   
Store3     | District3   |             |             |               |
Store4     | District4   | District7   | District10  | District15    | 
Store5     | District12  | District15  |             |               |

Thanks in advance!

Just for clarification, in other words, I'm looking for an option to replace each district id in df1 with a district name corresponding to the id number from df2. — elmd
– elmd, Commented Aug 12, 2021 at 18:03

ThePyGuy · Accepted Answer · 2021-08-12 18:26:00Z

1

You can stack first dataframe, then convert it to float type, map the column from second dataframe, then unstack and finally add_prefix:

df1.stack().astype(float).map(df2['district_name']).unstack().add_prefix('district_name')

OUTPUT:

           district_name0 district_name1  ... district_name3 district_name4
store_name                                ...                              
Store1          District1      District2  ...      District4      District5
Store2          District1      District2  ...            NaN            NaN
Store3          District3            NaN  ...            NaN            NaN
Store4          District4      District7  ...            NaN            NaN
Store5                NaN            NaN  ...            NaN            NaN

The dataframes used for above code:

>>> df1
             0    1    2    3    4
store_name                        
Store1       1    2    3    4    5
Store2       1    2  NaN  NaN  NaN
Store3       3  NaN  NaN  NaN  NaN
Store4       4    7   10   15  NaN
Store5      12   15  NaN  NaN  NaN

>>> df2
            district_name
district_id              
1               District1
2               District2
3               District3
4               District4
5               District5
6               District6
7               District7
8               District8
9               District9
10             District10

answered Aug 12, 2021 at 18:26

ThePyGuy

18.5k5 gold badges24 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

elmd Over a year ago

Thanks, it's sounds great, but some reason I get an error "Index contains duplicate entries, cannot reshape" while trying to unstack. Store_name in df1 and district_id in df2 are indexes like you wrote, but still an error((

ThePyGuy Over a year ago

Can you try df1.stack().astype(float).map(df2['district_name']).unstack(0).T once

elmd Over a year ago

I've tried a few times more and now it works in both ways (as your first code and with transpose too), but not there is another problem: it shows all NaN. Here it is a screenshot.

ThePyGuy Over a year ago

Well, that might be because district_id is string type. map won't work if type mismatches. Try to convert it also to float. Try this: df1.stack().astype(float).map(df2.rename(index=float)['district_name']).unstack()

elmd Over a year ago

You're right, it was string type from start because of merged id type like this 1;2;3, and after splitting remains string. After converting it works fine, thank you very much!

db702 · Accepted Answer · 2021-08-12 18:09:02Z

0

So there are many ways to possibly do this, this is just one. Assume you have your two dataframes stored as df1 and df2:

First, normalize your district_id column in df1 so that they are all the same length:

# make all strings the same size when split
def return_full_string(text):
    l = len(text.split(';'))
    for _ in range(5 - l):
        text = f"{text};"
    return text

df1['district_id'] = df1.district_id.apply(return_full_string)

Then split the text column into separate columns and delete the original:

# split district id's into different columns
district_columns = [f"district_name{n+1}" for n in range(5)]
df1[district_columns] = list(df1.district_id.str.split(';'))
df1.drop('district_id', inplace=True)

Then acquire a map of the ids in df2 to their respective names, and use that to replace the values in your new columns:

id_to_name = {str(ii): nn for ii, nn in zip(df2['district_id'], df2['district_name'])}
for col in district_columns:
    df1[col] = df1[col].apply(id_to_name.get)

Like I said, I'm sure there are other ways to do this, but this should work

answered Aug 12, 2021 at 18:09

db702

5884 silver badges12 bronze badges

1 Comment

elmd Over a year ago

Thanks for your time and answer! I need more time to try your code, it looks a bit complicated for me, but I'll try

Tanzin Farhat · Accepted Answer · 2021-08-12 18:54:36Z

0

df1=pd.DataFrame(data={'store_name':['store1','store2','store3','store4','store5'],
                   'district_id':[[1,2,3,4,5], [1,2], 3, [4,7,10], [8,10]]})
df2=pd.DataFrame(data={'district_id':[1,2,3,4,5,6,7,8,9,10],
                       'district_name':['District1', 'District2', 'District3', 'District4', 'District5', 'District6', 'District7', 'District8', 'District9', 'District10']})

step 1:use explode() to split values to rows

df3=df1.explode('district_id').reset_index(drop=True)

step2: use merge() with on='district_id'

df4=pd.merge(df3,df2, on='district_id' )

step 3: use groupby() & agg() to get column with lists

df5=df4.groupby('district_name').agg(list).reset_index()
    store_name  district_id                       district_name
0   store1  [1, 2, 3, 4, 5]   [District1,District2,District3,District4,District5]
1   store2  [1, 2]            [District1,District2]
2   store3  [3]               [District3]
3   store4  [4, 7, 10]        [District4,District7,District10]
4   store5  [10, 8]           [District10,District8]

Then it can be split however required.

answered Aug 12, 2021 at 18:54

Tanzin Farhat

3371 gold badge2 silver badges14 bronze badges

2 Comments

elmd Over a year ago

Awesome, thanks! Until step 3 it works amazing, but step 3 looks like a bit throwback to the beginning, but I understand your advice, it's kind of ready data for further work in a different ways from this step

Tanzin Farhat Over a year ago

exactly I just tried get the column in a shape from where it can be split, assuming you already have a way to split columns using functions like tolist() and add_prefix()

MDR · Accepted Answer · 2021-08-12 19:05:57Z

I'd suggest something like the below and then pivot etc. as required as having a column with strings like 1;2;3;4;5 in it is going to be awkward (I feel).

import pandas as pd

df1 = pd.DataFrame({'store_name': {0: 'Store1',
  1: 'Store2',
  2: 'Store3',
  3: 'Store4',
  4: 'Store5'},
 'district_id': {0: '1;2;3;4;5',
  1: '1;2',
  2: '3',
  3: '4;7;10;15',
  4: '12;15;'}})

df3 = pd.DataFrame({'district_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
 'district_name': {0: 'District1',
  1: ' District2',
  2: ' District3',
  3: ' District4',
  4: ' District5',
  5: ' District6',
  6: ' District7',
  7: ' District8',
  8: ' District9',
  9: ' District10'}})

# 'explode' the 'district_id' column with strings like '1;2;3;4;5' in df1
df2 = pd.DataFrame(df1.district_id.str.split(';').tolist(), index=df1.store_name).stack()
df2 = df2.reset_index()[[0, 'store_name']]
df2.columns = ['district_id', 'store_name']
df2 = df2[~df2['district_id'].eq('')] 
df2['district_id'] = df2['district_id'].astype(int)

'''df2 Shows:

    district_id     store_name
0   1               Store1
1   2               Store1
2   3               Store1
3   4               Store1
4   5               Store1
etc.
'''

df4 = pd.merge(df2, df3, on='district_id', how='left')

print(df4)

    district_id store_name district_name
0             1     Store1     District1
1             2     Store1     District2
2             3     Store1     District3
3             4     Store1     District4
4             5     Store1     District5
5             1     Store2     District1
6             2     Store2     District2
7             3     Store3     District3
8             4     Store4     District4
9             7     Store4     District7
10           10     Store4    District10
11           15     Store4           NaN
12           12     Store5           NaN
13           15     Store5           NaN

# From here you can pivot df4 etc. and carry on as required.

That's great option, thanks! It works and provide even more options for work than I need)

Collectives™ on Stack Overflow

Compare and match values from two df and multiple columns

4 Answers 4

5 Comments

1 Comment

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related