257

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?

In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3}, 
                          'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})

In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3}, 
                          'to_merge_on': {0: 1, 1: 3, 2: 5}})

In [6]: a
Out[6]:
   col1  to_merge_on
a     1            1
b     2            3
c     3            4

In [7]: b
Out[7]:
   col2  to_merge_on
0     1            1
1     2            3
2     3            5

In [8]: a.merge(b, how='left')
Out[8]:
   col1  to_merge_on  col2
0     1            1   1.0
1     2            3   2.0
2     3            4   NaN

In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')

EDIT: Switched to example code that can be easily reproduced

2
  • 2
    if you merge on a specific column, it is not clear which indices to use (in case they are both different). Commented Aug 23, 2018 at 20:56
  • 13
    It is pretty clear if you do a left or right merge for example. Commented Jul 16, 2021 at 11:26

10 Answers 10

301
In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
       col1  to_merge_on  col2
index
a         1            1     1
b         2            3     2
c         3            4   NaN

Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.

Sign up to request clarification or add additional context in comments.

14 Comments

Very clever. a.merge(b, how="left").set_index(a.index) also works, but it seems less robust (since the first part of it loses the index values to a before it resets them.)
For this particular case, those are equivalent. But for many merge operations, the resulting frame has not the same number of rows than of the original a frame. reset_index moves the index to a regular column and set_index from this column after merge also takes care when rows of a are duplicated/removed due to the merge operation.
@Wouter I'd love to know why a left merge will reindex by default. Where can I learn more?
Nice! To avoid explicitly specifying the index-name I use a.reset_index().merge(b, how="left").set_index(a.index.names).
Pandas badly thought API strikes again.
|
35

You can make a copy of index on left dataframe and do merge.

a['copy_index'] = a.index
a.merge(b, how='left')

I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).

This approach would be superior when resetting index is expensive (large dataframe).

5 Comments

This is the best answer. There are many reasons why you would want to preserve your old indexes during a merge (and the accepted answer doesn't preserve indexes, it just resets them). It helps when you're trying to merge more than 2 dataframes, and so on...
upvoted but just be wary of a caveat, when using multi-index, your indices will be stored as a tuple in a single column called a[copy_index]
What I am reading in the docs about merge_asof indicates it is not using the index to join, it is using the closes index to join. You also have to have your data sorted a certain way so the closest index joins properly.
This is just a less elegant version of the reset_index() solution. @MartienLubberink is incorrect, as reset_index() stores the index as a column by default.
@Migwell Incorrect. reset_index() is way sluggish when the data is large. As mentioned, the answer is superior when resetting index is expensive.
12

There is a non-pd.merge solution using Series.map and DataFrame.set_index.

a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))

   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

This doesn't introduce a dummy index name for the index.

Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.

4 Comments

This seems superior to the accepted answer as it will probably work better with edge cases like multi indexes. Can anyone comment on this?
question, what if you need to assign multiple columns, would this approach work or is it limited to only 1 field?
@Yuca: This possibly won't work with multiple columns, since when you subset multiple columns you end up with a pd.Dataframe and not a pd.Series. The .map() method is only defined for the pd.Series. This is to mean that: a[['to_merge_on_1', 'to_merge_on_2']].map(...) won't work.
Brilliant. In my project we are using too many pandas tricks everywhere. This is very refreshing as it is straight forward and low level. Thank you!
10
df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)

This allows to preserve the index of df1

2 Comments

It seems to work, but when I use it with on=list_of_cols], it contradicts the documentation: If joining columns on columns, the DataFrame indexes *will be ignored*. Is one of using indices vs. columns has precedence?
@Supratik Majumdar doesn't your suggestion assume the indexes of the dataframes already match? The OP has non-matching indexes and is merging/joining on columns.
7

You can also use DataFrame.join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.

In [17]: a.join(b.set_index("to_merge_on"), on="to_merge_on")
Out[17]: 
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

1 Comment

Seems faster than the merge-based solutions.
3

Assuming that the resulting df has the same number of rows and order as your first df, you can do this:

c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)

Comments

2

another simple option is to rename the index to what was before:

a.merge(b, how="left").set_axis(a.index)

merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis

3 Comments

This didn't work for me
Nope. Doesn't work..
this is the correct syntax: a.merge(b, how="left").set_index(a.index)
0

Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:

First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')

Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:

First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()

Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):

First10ReviewsJoined.set_index('Line Number', inplace=True)

Then removed the index name of Line Number so that it remains blank:

First10ReviewsJoined.index.name = None

Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.

Comments

0

For the people that wants to maintain the left index as it was before the left join:

def left_join(
    a: pandas.DataFrame, b: pandas.DataFrame, on: list[str], b_columns: list[str] = None
) -> pandas.DataFrame:
    if b_columns:
        b_columns = set(on + b_columns)
        b = b[b_columns]
    df = (
        a.reset_index()
        .merge(
            b,
            how="left",
            on=on,
        )
        .set_index(keys=[x or "index" for x in a.index.names])
    )
    df.index.names = a.index.names
    return df

Comments

0

I've always preferred to do it as pd.merge(df1, df2) and explicitly mention the columns I'm using for the merge. Here's what works for me (pandas 2.0.3):

a = pd.merge(a, b, left_on='to_merge_on', right_on='to_merge_on', how="left").set_axis(a.index)

@lisrael1 did mention set_axis as a solution but I'm not sure why it didn't work for some people. NB: this only works for how='left', which was OP's original question

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.