Pandas: Joining information from multiple data frames, array

Question

Suppose I have three data structures:

A data frame df1, with columns A, B, C of length 10000
A data frame df2, with columns A, some extra misc. columns... of length 8000
A Python list labels of length 8000, where the element at index i corresponds with row i in df2.

I'm trying to create a data frame from this information that, for every element in df2.a, I grab the relevant row from df1 and labels to pair up this information. It's possible that an entry in df2.A is NOT present in df1.A.

Currently, I'm doing this through a for i in xrange(len(df2)) loop, checking if df2.A.iloc[i] is present in df1.A, and if it is, I store df1.A, df1.B, df1.C, labels[i] into a dictionary with the first element as the key and the rest of the elements as a list.

Is there a more efficient way to do this and store the outputs df1.A, df1.B, df1.C, labels[i] into a 4 columns dataframe? The for loop is really slow.

Sample data:

df1
A       B       C
'uid1'  'Bob'   'Rock'
'uid2'  'Jack'  'Pop'
'uid5'  'Cat'   'Country'
...

df2
A
'uid10'
'uid3'
'uid1'
...

labels
[label10, label3, label1, ...]

Can you post data as there may be subtle problems with various approaches. So essentially df2 is your master df and you want to create a new df where df2.A is in df1.A and if so use the row values from df1 and the corresponding labels, is this correct? — EdChum
– EdChum, Commented Oct 16, 2014 at 8:10
I posted some example data above. df2 is basically only being used so I know which relevant element of labels I want to join with a row of df1. But yeah, you understood what I'm trying to do correctly. — kk415kk
– kk415kk, Commented Oct 16, 2014 at 8:15

EdChum · Accepted Answer · 2014-10-16 08:40:50Z

1

OK from what I understand the following should work:

# create a new column for your labels, this will align to your index
df2['labels'] = labels
# now merge the rows from df1 on column 'A'
df2 = df2.merge(df1, on='A', how='left')

Example:

# setup my sample data
temp="""A       B       C
'uid1'  'Bob'   'Rock'
'uid2'  'Jack'  'Pop'
'uid5'  'Cat'   'Country'"""

temp1="""A
'uid10'
'uid3'
'uid1'"""
labels = ['label10', 'label3', 'label1']
df1 = pd.read_csv(io.StringIO(temp), sep='\s+')
df2 = pd.read_csv(io.StringIO(temp1))

In [97]:
# do the work
df2['labels'] = labels
df2 = df2.merge(df1, on='A', how='left')
df2
Out[97]:
         A   labels      B       C
0  'uid10'  label10    NaN     NaN
1   'uid3'   label3    NaN     NaN
2   'uid1'   label1  'Bob'  'Rock'

This will be considerably faster than looping

edited Oct 16, 2014 at 8:40

answered Oct 16, 2014 at 8:17

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas: Joining information from multiple data frames, array

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related