2

I have a dataframe called df1:

        Long_ID     IndexBegin         IndexEnd
0       10000001        0                  3
1       10000002        3                  6
2       10000003        6                 10            

I have a second dataframe called df2, which can be up to 1 million rows long:

      Short_ID
0        1
1        2
2        3
3        10
4        20
5        30
6        100
7        101
8        102
9        103

I want to link Long_ID to Short_ID in such a way that if (IndexBegin:IndexEnd) is (0:3), then Long_ID gets inserted into df2 at indexes 0 through 2 (IndexEnd - 1). The starting index and ending index are determined using the last two columns of df1.

So that ultimately, my final dataframe looks like this: df3:

      Short_ID       Long_ID
0        1          10000001
1        2          10000001
2        3          10000001
3        10         10000002
4        20         10000002
5        30         10000002
6        100        10000003
7        101        10000003
8        102        10000003
9        103        10000003

First, I tried storing the index of df2 as a key and Short_ID as a value in a dictionary, then iterating row by row, but that was too slow. This led me to learn about vectorization.

Then, I tried using where(), but I got "ValueError: Can only compare identically-labeled Series objects."

df2 = df2.reset_index()
df2['Long_ID'] = df1['Long_ID'] [ (df2['index'] < df1['IndexEnd']) & (df2['index'] >= df1['IndexBegin']) ]

I am relatively new to programming, and I appreciate if anyone can give a better approach to solving this problem. I have reproduced the code below:

df1_data = [(10000001, 0, 3), (10000002, 3, 6), (10000003, 6, 10)]
df1 = pd.DataFrame(df1_data, columns = ['Long_ID', 'IndexBegin', 'IndexEnd'])

df2_data = [1, 2, 3, 10, 20, 30, 100, 101, 102, 103]
df2 = pd.DataFrame(df2_data, columns = ['Short_ID'])

5 Answers 5

5

df2 does not need "IndexEnd" as long as the ranges are contiguous. You may use pd.merge_asof:

(pd.merge_asof(df2.reset_index(), df1, left_on='index', right_on='IndexBegin')
   .reindex(['Short_ID', 'Long_ID'], axis=1))

   Short_ID   Long_ID
0         1  10000001
1         2  10000001
2         3  10000001
3        10  10000002
4        20  10000002
5        30  10000002
6       100  10000003
7       101  10000003
8       102  10000003
9       103  10000003
Sign up to request clarification or add additional context in comments.

Comments

4

Here is one way using IntervalIndex

df1.index=pd.IntervalIndex.from_arrays(left=df1.IndexBegin,right=df1.IndexEnd,closed='left')
df2['New']=df1.loc[df2.index,'Long_ID'].values

Comments

0

you may do :

df3 = df2.copy()
df3['long_ID'] =  df2.merge(df1, left_on =df2.index,right_on = "IndexBegin", how = 'left').Long_ID.ffill().astype(int)

Comments

0

I created a function to solve your question. Hope it helps.

df = pd.read_excel('C:/Users/me/Desktop/Sovrflw_data_2.xlsx')
df

    Long_ID     IndexBegin  IndexEnd
0   10000001    0           3
1   10000002    3           6
2   10000003    6           10

df2 = pd.read_excel('C:/Users/me/Desktop/Sovrflw_data.xlsx')
df2

    Short_ID
0   1
1   2
2   3
3   10
4   20
5   30
6   100
7   101
8   102
9   103

def convert_Short_ID(df1,df2):
    df2['Long_ID'] = None
    for i in range(len(df2)):
        for j in range(len(df)):
            if (df2.index[i] >= df.loc[j,'IndexBegin']) and (df2.index[i] < df.loc[j,'IndexEnd']):
                number = str(df.iloc[j, 0])
                df2.loc[i,'Long_ID'] = df.loc[j, 'Long_ID']
                break
            else:
                df2.loc[i, 'Long_ID'] = np.nan

    df2['Long_ID'] = df2['Long_ID'].astype(str)
    return df2

convert_Short_ID(df,df2)   
    Short_ID    Long_ID
0   1           10000001
1   2           10000001
2   3           10000001
3   10          10000002
4   20          10000002
5   30          10000002
6   100         10000003
7   101         10000003
8   102         10000003
9   103         10000003

Comments

0

Using Numpy to create the data before creating a Data Frame is a better approach since adding elements to a Data Frame is time-consuming. So:

import numpy as np
import pandas as pd

#Step 1: creating the first Data Frame
df1 = pd.DataFrame({'Long_ID':[10000001,10000002,10000003],
                 'IndexBegin':[0,3,6],
                   'IndexEnd':[3,6,10]})

#Step 2: creating the second chunk of data as a Numpy array
Short_ID = np.array([1,2,3,10,20,30,100,101,102,103])

#Step 3: creating a new column on df1 to count Long_ID ocurrences
df1['Qt']=df1['IndexEnd']-df1['IndexBegin']

#Step 4: using append to create a Numpy Array for the Long_ID item 
Long_ID = np.array([])
for i in range(len(df1)):
    Long_ID = np.append(Long_ID, [df1['Long_ID'][i]]*df1['Qt'][i])

#Finally, create the seconc Data Frame using both previous Numpy arrays
df2 = pd.DataFrame(np.vstack((Short_ID, Long_ID)).T, columns=['Short_ID','Long_ID'])
df2
    Short_ID    Long_ID
0        1.0 10000001.0
1        2.0 10000001.0
2        3.0 10000001.0
3       10.0 10000002.0
4       20.0 10000002.0
5       30.0 10000002.0
6      100.0 10000003.0
7      101.0 10000003.0
8      102.0 10000003.0
9      103.0 10000003.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.