1

Somewhat similar to Excel's VLOOKUP function, I am wanting to use a value in one dataframe (portfolios below) to find an associated value in a second dataframe (returns below) and populate a third dataframe (let's call this dataframe3 for now) with these returned values. I have found several posts based on left merges and map, but my original two dataframes are of different structures, so these methods don't seem to fit (to me, at least).

I haven't made much progress, but here is the code I have:

Code

import pandas as pd

portfolios = pd.read_csv('portstst5_1.csv')
returns = pd.read_csv('Example_Returns.csv')

total_cols = len(portfolios.columns)
headers = list(portfolios)

concat = returns['PERMNO'].map(str) + returns['FROMDATE'].map(str)
idx = 2
returns.insert(loc=idx, column="concat", value=concat)

for i in range(total_cols):
    col_len = portfolios.iloc[:,i].count()
    for j in range(col_len):
        print(portfolios.iat[j,i].astype('int').astype('str') + headers[i])

Data

This code will make a little more sense if I first describe my data: portfolios is a dataframe with 13 columns of varying lengths. The column headers are dates in YYYYMMDD format. Below each date header are identifiers which are five digit numeric codes. A snippet of portfolios looks like this (some elements in some columns contain NaN):

    20131231  20131130  20131031  20130930  20130831  20130731  20130630  \
0    93044.0   93044.0   13264.0   13264.0   89169.0   82486.0   91274.0   
1    79702.0   91515.0   90710.0   81148.0   47387.0   88359.0   93353.0   
2    85751.0   85724.0   88810.0   11513.0   85576.0   47387.0   85576.0

The data in returns data originally consists of three columns and 799 rows and looks like this (all elements are populated with values):

     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766

Desired Output

I would like to make a third dataframe that is structured identically to portfolios. That is, it will have the same column header dates and the same number of rows in each column as does portfolios, but instead of identifiers, it will contain the MORET for the appropriate identifier/date combination. This is the reason for the concatenations in my code above - I am trying (perhaps unnecessarily) to create unique lookup values so I can communicate between portfolios and returns. For example, to populate dataframe3[0,0], I would look for the concatenated values from portfolios[0,0] and headers[0] (i.e. 9304420131231) in returns['concat'] and return the associated value in returns['MORET'] (i.e. -0.022304). I am stuck here on how to use the concatenated values to return my desired data.

Any thoughts are greatly appreciated.

0

3 Answers 3

1

IIUC:

Using a combination of melt so the we can merge values from returns by desired columns. Then use pivot to reshape the data back, as seen below.

portfolios.columns = portfolios.columns.astype(int)
newdf = portfolios.reset_index().melt(id_vars='index',var_name=['FROMDATE'],value_name='PERMNO').merge(returns,on=['FROMDATE','PERMNO'],how='left').pivot(index='index',columns='FROMDATE',values='MORET')

Which returnsthe DataFrame below

FROMDATE  20130630  20130731  20130831  20130930  20131031  20131130  20131231
index
0              NaN       NaN       NaN       NaN       NaN       NaN -0.022304
1              NaN       NaN       NaN       NaN       NaN       NaN  0.012283
2              NaN       NaN       NaN       NaN       NaN       NaN -0.016453

Sort columns

newdf.loc[:,newdf.columns.sort_values(ascending=False)]
Sign up to request clarification or add additional context in comments.

7 Comments

When trying this, I am receiving a ValueError: Cannot convert non-finite values (NA or inf) to integer
Do you have NaN values in you original dataframe? could try and replace .astype(int) with .astype(float)@CodingNewb
@CodingNewb, I was using astype(int) because the column names were stored as strings, I edited the post to change the column names before the merge, and removed .astype(int) from the main line, let me know if this works
This is fantastic, and I've learned a lot breaking this into parts. One final question: Is it possible to sort the output with the columns in reverse (descending) order? My portfolios file that I read in is sorted in descending order, but the output of the pivot table is in reverse. Thank you.
@CodingNewb, use .loc instead of iloc, sorry about that
|
1

What you are trying to do is much simpler than how you tried doing it. You can first melt portfolios to flip it and collect all the date columns as rows in a single column, then join it with returns, and finally pivot to get the desired result. This is basically what @djk47463 did in one compound line, and my edited answer serves as a step-by-step breakdown of his.

Let's create your DataFrames to make the answer reproducible.

import pandas as pd
import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

# Create df
rawText = StringIO("""
     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766
4     93044  20131010 -0.02
5     79702  20131010  0.01
6     85751  20131010 -0.01
7     85576  20131010  0.03
""")
returns = pd.read_csv(rawText, sep = "\s+")
portfolios = pd.DataFrame({'20131010':[93044, 85751],
                       '20131231':[85576, 79702]})

Notice, the FROMDATE column of returns consists of numbers, but in portfolios the date columns are strings. We must make them consistent:

df.FROMDATE = df.FROMDATE.astype(str)

Let's start the solution by melting (i.e. unpivot) portfolios:

portfolios = portfolios.melt(var_name=['FROMDATE'],value_name='PERMNO')
# portfolios: 
   FROMDATE  PERMNO
0  20131010   93044
1  20131010   85751
2  20131231   85576
3  20131231   79702

Now you want to hold this pm constant, and merge returns to its lines whenever their PERMNOs and FROMDATEs match:

merged = pm.merge(df, how='left', on=['PERMNO', 'FROMDATE'])
# merged: 
   FROMDATE  PERMNO     MORET
0  20131010   93044 -0.020000
1  20131010   85751 -0.010000
2  20131231   85576  0.038766
3  20131231   79702  0.012283

Remember we had melted (unpivoted) the portfolios at the beginning? We should pivot this result to give it the shape of portfolios:

final = merged.pivot(index='PERMNO', columns='FROMDATE', values='MORET').reset_index()
# final: 
FROMDATE  PERMNO  20131010  20131231
0          79702       NaN  0.012283
1          85576       NaN  0.038766
2          85751     -0.01       NaN
3          93044     -0.02       NaN

5 Comments

Thank you for this; the pivot table is very helpful. However, when I do the intersection, new_df is an Empty DataFrame. One thought I had on why it is empty: The column headers in portfolios are in descending order (20131231, 20131130,etc), but the column headers in new_df (prior to the intersection) are in ascending order. However, I can't figure out how to reverse the column order in the pivot table. Could this be the reason for the Empty DataFrame?
Not that, but it can be that the column names of one are strings, and the other numbers. Will be as easy as fixing the common. Could you give me the result of portfolio.columns and new_df.columns? If they are large please give me 5 of each.
Pretty sure that is the reason. Can you please let me know if this solves it: common = [e for e in new_df.columns if str(e) in portfolio.columns] and then you'll again do: new_df = new_df[common].
Yes - that was the reason. portfolios.columns contains strings: Index(['20131231', '20131130', '20131031', '20130930', '20130831', and new_df.columns contains ints: Int64Index([20120131, 20120229, 20120331, 20120430, 20120531,... The code now returns new_df with the correct number of columns, but the columns contain MORET for every PERMNO with a MORET whether or not the PERMNO is present in the associated month from portfolios. That is, while every PERMNO in returns has a MORET in each FROMDATE, not every PERMNO exists in every column of portfolios.
@CodingNewb I see your point. I overlooked this final step, it will be better if you don't immediately pivot the returns, but first melt portfolios, then join with returns, then pivot; which is basically what djk47463 did. I'll edit my answer to take care of the full process in small pieces, which will serve as a breakdown and step-by-step explanation, but as the owner of the first complete answer, djk47463 deserves the checkmark. Thank you for the good question!
0

The typical way to do a vlookup in python is to create a series with what would be your left column in the index, and then slice that series by the lookup value. The NaNs complicate it a little. We'll make a series from returns by using the set_index method to set PERMNO as the index for the dataframe, and then slicing by the column name to isolate the MORET column as a series.

lookupseries = returns.set_index('PERMNO')['MORET']
def lookup(x):
    try: 
        return lookupseries[x]
    except: 
        return np.nan
newdf = portfolios.copy()
for c in newdf.columns:
    newdf[c] = newdf[c].apply(lookup)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.