1

Suppose I have three data structures:

  1. A data frame df1, with columns A, B, C of length 10000
  2. A data frame df2, with columns A, some extra misc. columns... of length 8000
  3. A Python list labels of length 8000, where the element at index i corresponds with row i in df2.

I'm trying to create a data frame from this information that, for every element in df2.a, I grab the relevant row from df1 and labels to pair up this information. It's possible that an entry in df2.A is NOT present in df1.A.

Currently, I'm doing this through a for i in xrange(len(df2)) loop, checking if df2.A.iloc[i] is present in df1.A, and if it is, I store df1.A, df1.B, df1.C, labels[i] into a dictionary with the first element as the key and the rest of the elements as a list.

Is there a more efficient way to do this and store the outputs df1.A, df1.B, df1.C, labels[i] into a 4 columns dataframe? The for loop is really slow.

Sample data:

df1
A       B       C
'uid1'  'Bob'   'Rock'
'uid2'  'Jack'  'Pop'
'uid5'  'Cat'   'Country'
...

df2
A
'uid10'
'uid3'
'uid1'
...

labels
[label10, label3, label1, ...]
2
  • Can you post data as there may be subtle problems with various approaches. So essentially df2 is your master df and you want to create a new df where df2.A is in df1.A and if so use the row values from df1 and the corresponding labels, is this correct? Commented Oct 16, 2014 at 8:10
  • I posted some example data above. df2 is basically only being used so I know which relevant element of labels I want to join with a row of df1. But yeah, you understood what I'm trying to do correctly. Commented Oct 16, 2014 at 8:15

1 Answer 1

1

OK from what I understand the following should work:

# create a new column for your labels, this will align to your index
df2['labels'] = labels
# now merge the rows from df1 on column 'A'
df2 = df2.merge(df1, on='A', how='left')

Example:

# setup my sample data
temp="""A       B       C
'uid1'  'Bob'   'Rock'
'uid2'  'Jack'  'Pop'
'uid5'  'Cat'   'Country'"""

temp1="""A
'uid10'
'uid3'
'uid1'"""
labels = ['label10', 'label3', 'label1']
df1 = pd.read_csv(io.StringIO(temp), sep='\s+')
df2 = pd.read_csv(io.StringIO(temp1))

In [97]:
# do the work
df2['labels'] = labels
df2 = df2.merge(df1, on='A', how='left')
df2
Out[97]:
         A   labels      B       C
0  'uid10'  label10    NaN     NaN
1   'uid3'   label3    NaN     NaN
2   'uid1'   label1  'Bob'  'Rock'

This will be considerably faster than looping

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.