Create new dataframe using "VLOOKUP" between two dataframes

Question

Somewhat similar to Excel's VLOOKUP function, I am wanting to use a value in one dataframe (portfolios below) to find an associated value in a second dataframe (returns below) and populate a third dataframe (let's call this dataframe3 for now) with these returned values. I have found several posts based on left merges and map, but my original two dataframes are of different structures, so these methods don't seem to fit (to me, at least).

I haven't made much progress, but here is the code I have:

Code

import pandas as pd

portfolios = pd.read_csv('portstst5_1.csv')
returns = pd.read_csv('Example_Returns.csv')

total_cols = len(portfolios.columns)
headers = list(portfolios)

concat = returns['PERMNO'].map(str) + returns['FROMDATE'].map(str)
idx = 2
returns.insert(loc=idx, column="concat", value=concat)

for i in range(total_cols):
    col_len = portfolios.iloc[:,i].count()
    for j in range(col_len):
        print(portfolios.iat[j,i].astype('int').astype('str') + headers[i])

Data

This code will make a little more sense if I first describe my data: portfolios is a dataframe with 13 columns of varying lengths. The column headers are dates in YYYYMMDD format. Below each date header are identifiers which are five digit numeric codes. A snippet of portfolios looks like this (some elements in some columns contain NaN):

    20131231  20131130  20131031  20130930  20130831  20130731  20130630  \
0    93044.0   93044.0   13264.0   13264.0   89169.0   82486.0   91274.0   
1    79702.0   91515.0   90710.0   81148.0   47387.0   88359.0   93353.0   
2    85751.0   85724.0   88810.0   11513.0   85576.0   47387.0   85576.0

The data in returns data originally consists of three columns and 799 rows and looks like this (all elements are populated with values):

     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766

Desired Output

I would like to make a third dataframe that is structured identically to portfolios. That is, it will have the same column header dates and the same number of rows in each column as does portfolios, but instead of identifiers, it will contain the MORET for the appropriate identifier/date combination. This is the reason for the concatenations in my code above - I am trying (perhaps unnecessarily) to create unique lookup values so I can communicate between portfolios and returns. For example, to populate dataframe3[0,0], I would look for the concatenated values from portfolios[0,0] and headers[0] (i.e. 9304420131231) in returns['concat'] and return the associated value in returns['MORET'] (i.e. -0.022304). I am stuck here on how to use the concatenated values to return my desired data.

Any thoughts are greatly appreciated.

DJK · Accepted Answer · 2017-12-30 02:39:43Z

1

IIUC:

Using a combination of melt so the we can merge values from returns by desired columns. Then use pivot to reshape the data back, as seen below.

portfolios.columns = portfolios.columns.astype(int)
newdf = portfolios.reset_index().melt(id_vars='index',var_name=['FROMDATE'],value_name='PERMNO').merge(returns,on=['FROMDATE','PERMNO'],how='left').pivot(index='index',columns='FROMDATE',values='MORET')

Which returnsthe DataFrame below

FROMDATE  20130630  20130731  20130831  20130930  20131031  20131130  20131231
index
0              NaN       NaN       NaN       NaN       NaN       NaN -0.022304
1              NaN       NaN       NaN       NaN       NaN       NaN  0.012283
2              NaN       NaN       NaN       NaN       NaN       NaN -0.016453

Sort columns

newdf.loc[:,newdf.columns.sort_values(ascending=False)]

edited Dec 30, 2017 at 2:39

answered Dec 29, 2017 at 1:54

DJK

9,3424 gold badges28 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

CodingNewb Over a year ago

When trying this, I am receiving a ValueError: Cannot convert non-finite values (NA or inf) to integer

DJK Over a year ago

Do you have NaN values in you original dataframe? could try and replace .astype(int) with .astype(float)@CodingNewb

DJK Over a year ago

@CodingNewb, I was using astype(int) because the column names were stored as strings, I edited the post to change the column names before the merge, and removed .astype(int) from the main line, let me know if this works

CodingNewb Over a year ago

This is fantastic, and I've learned a lot breaking this into parts. One final question: Is it possible to sort the output with the columns in reverse (descending) order? My portfolios file that I read in is sorted in descending order, but the output of the pivot table is in reverse. Thank you.

DJK Over a year ago

@CodingNewb, use .loc instead of iloc, sorry about that

|

FatihAkici · Accepted Answer · 2017-12-30 02:35:37Z

1

What you are trying to do is much simpler than how you tried doing it. You can first melt portfolios to flip it and collect all the date columns as rows in a single column, then join it with returns, and finally pivot to get the desired result. This is basically what @djk47463 did in one compound line, and my edited answer serves as a step-by-step breakdown of his.

Let's create your DataFrames to make the answer reproducible.

import pandas as pd
import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

# Create df
rawText = StringIO("""
     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766
4     93044  20131010 -0.02
5     79702  20131010  0.01
6     85751  20131010 -0.01
7     85576  20131010  0.03
""")
returns = pd.read_csv(rawText, sep = "\s+")
portfolios = pd.DataFrame({'20131010':[93044, 85751],
                       '20131231':[85576, 79702]})

Notice, the FROMDATE column of returns consists of numbers, but in portfolios the date columns are strings. We must make them consistent:

df.FROMDATE = df.FROMDATE.astype(str)

Let's start the solution by melting (i.e. unpivot) portfolios:

portfolios = portfolios.melt(var_name=['FROMDATE'],value_name='PERMNO')
# portfolios: 
   FROMDATE  PERMNO
0  20131010   93044
1  20131010   85751
2  20131231   85576
3  20131231   79702

Now you want to hold this pm constant, and merge returns to its lines whenever their PERMNOs and FROMDATEs match:

merged = pm.merge(df, how='left', on=['PERMNO', 'FROMDATE'])
# merged: 
   FROMDATE  PERMNO     MORET
0  20131010   93044 -0.020000
1  20131010   85751 -0.010000
2  20131231   85576  0.038766
3  20131231   79702  0.012283

Remember we had melted (unpivoted) the portfolios at the beginning? We should pivot this result to give it the shape of portfolios:

final = merged.pivot(index='PERMNO', columns='FROMDATE', values='MORET').reset_index()
# final: 
FROMDATE  PERMNO  20131010  20131231
0          79702       NaN  0.012283
1          85576       NaN  0.038766
2          85751     -0.01       NaN
3          93044     -0.02       NaN

edited Dec 30, 2017 at 2:35

answered Dec 28, 2017 at 22:50

FatihAkici

5,1594 gold badges34 silver badges52 bronze badges

5 Comments

CodingNewb Over a year ago

Thank you for this; the pivot table is very helpful. However, when I do the intersection, new_df is an Empty DataFrame. One thought I had on why it is empty: The column headers in portfolios are in descending order (20131231, 20131130,etc), but the column headers in new_df (prior to the intersection) are in ascending order. However, I can't figure out how to reverse the column order in the pivot table. Could this be the reason for the Empty DataFrame?

FatihAkici Over a year ago

Not that, but it can be that the column names of one are strings, and the other numbers. Will be as easy as fixing the common. Could you give me the result of portfolio.columns and new_df.columns? If they are large please give me 5 of each.

FatihAkici Over a year ago

Pretty sure that is the reason. Can you please let me know if this solves it: common = [e for e in new_df.columns if str(e) in portfolio.columns] and then you'll again do: new_df = new_df[common].

CodingNewb Over a year ago

Yes - that was the reason. portfolios.columns contains strings: Index(['20131231', '20131130', '20131031', '20130930', '20130831', and new_df.columns contains ints: Int64Index([20120131, 20120229, 20120331, 20120430, 20120531,... The code now returns new_df with the correct number of columns, but the columns contain MORET for every PERMNO with a MORET whether or not the PERMNO is present in the associated month from portfolios. That is, while every PERMNO in returns has a MORET in each FROMDATE, not every PERMNO exists in every column of portfolios.

FatihAkici Over a year ago

@CodingNewb I see your point. I overlooked this final step, it will be better if you don't immediately pivot the returns, but first melt portfolios, then join with returns, then pivot; which is basically what djk47463 did. I'll edit my answer to take care of the full process in small pieces, which will serve as a breakdown and step-by-step explanation, but as the owner of the first complete answer, djk47463 deserves the checkmark. Thank you for the good question!

Jacob H · Accepted Answer · 2017-12-28 21:35:26Z

0

The typical way to do a vlookup in python is to create a series with what would be your left column in the index, and then slice that series by the lookup value. The NaNs complicate it a little. We'll make a series from returns by using the set_index method to set PERMNO as the index for the dataframe, and then slicing by the column name to isolate the MORET column as a series.

lookupseries = returns.set_index('PERMNO')['MORET']
def lookup(x):
    try: 
        return lookupseries[x]
    except: 
        return np.nan
newdf = portfolios.copy()
for c in newdf.columns:
    newdf[c] = newdf[c].apply(lookup)

answered Dec 28, 2017 at 21:35

Jacob H

6074 silver badges11 bronze badges

Collectives™ on Stack Overflow

Create new dataframe using "VLOOKUP" between two dataframes

3 Answers 3

7 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related