2

I am trying to write Python Pandas code to merge the data in two DataFrames, with the new DataFrame's data replacing the old DataFrame's data if the index and columns are identical. There seems to be a bug in Pandas that sometimes causes the column names to be mixed up.

Here is an example. First, create the two DataFrames:

In [1]: df1 = DataFrame([[1, 2, 3, 4]]*3, columns=["A1", "B2", "C3", "D4"], index=[0, 1, 2])

In [2]: df2 = DataFrame([[30, 10, 40, 20]]*3, columns=["C3", "A1", "D4", "B2"], index=[1, 2, 3])

In [3]: df1
Out[3]:
   A1  B2  C3  D4
0   1   2   3   4
1   1   2   3   4
2   1   2   3   4

[3 rows x 4 columns]

In [4]: df2
Out[4]:
   C3  A1  D4  B2
1  30  10  40  20
2  30  10  40  20
3  30  10  40  20

[3 rows x 4 columns]

Observe that df2 has the same columns but in a different order. The data is the same as 10*df1.

Now merge them:

In [5]: merge_df = DataFrame(index=df1.index.union(df2.index), columns=df1.columns.union(df2.columns))

In [6]: merge_df.loc[df1.index, df1.columns] = df1

In [7]: merge_df.loc[df2.index, df2.columns] = df2

In [8]: merge_df
Out[8]:
   A1  B2  C3  D4
0   1   2   3   4
1  10  20  30  40
2  10  20  30  40
3  10  20  30  40

[4 rows x 4 columns]

This works as expected.

Now redefine df2 so that it has a similar index as df1.

In [9]: df2 = DataFrame([[30, 10, 40, 20]]*3, columns=["C3", "A1", "D4", "B2"], index=[0, 1, 2])

In [10]: df2
Out[10]:
   C3  A1  D4  B2
0  30  10  40  20
1  30  10  40  20
2  30  10  40  20

[3 rows x 4 columns]

Then merge using the same code as before:

In [11]: merge_df = DataFrame(index=df1.index.union(df2.index), columns=df1.columns.union(df2.columns))

In [12]: merge_df.loc[df1.index, df1.columns] = df1

In [13]: merge_df.loc[df2.index, df2.columns] = df2

In [14]: merge_df
Out[14]:
   A1  B2  C3  D4
0  30  10  40  20
1  30  10  40  20
2  30  10  40  20

[3 rows x 4 columns]

Why are the column names and data mixed up? Am I using .loc wrong? Changing that last line to .ix does not fix the problem. It only works if I do this:

In [15]: merge_df = DataFrame(index=df1.index.union(df2.index), columns=df1.columns.union(df2.columns))

In [16]: merge_df.loc[df1.index, df1.columns] = df1

In [17]: merge_df[df2.columns] = df2

In [18]: merge_df
Out[18]:
   A1  B2  C3  D4
0  10  20  30  40
1  10  20  30  40
2  10  20  30  40

[3 rows x 4 columns]

That is the desired result.

I may be doing something wrong here, but if I am, there is something important I do not understand about DataFrames and I could be making similar mistakes elsewhere in my code. If that is the case, please explain.

I can't check the Pandas gitbug bugtracker as that website is blocked from work. Any help would be appreciated.

In [19]: pd.__version__
Out[19]: '0.13.1'
6
  • That was probably a bug, as this now works correctly with a more recent pandas (I tried it in 0.15.2). Are you able to update? Commented Dec 30, 2014 at 21:24
  • Not easily. I am in a team of 10-15 people and that would take some coordination. It can be done though. Is anyone familiar with this bug and know the minimum version I must go to? Commented Dec 30, 2014 at 22:56
  • I just tried this on my computer at home, which is also using 0.13.1. Frustratingly, I could not duplicate the problem. This happened on two different machines at work. Are there any Pandas dependencies I should be looking at? Not sure what to do next. Any ideas would be appreciated. Commented Dec 31, 2014 at 3:57
  • Just tested it again at work and I can still reproduce the bug. At work I am using Windows + Anaconda Python 2.7.6. At home I am using Linux (Unbuntu) and Python 2.7. Any ideas of what to do here? Commented Dec 31, 2014 at 14:15
  • I can reproduce this bug with Panels as well. Commented Jan 2, 2015 at 21:16

1 Answer 1

1

I need to upgrade to Pandas 0.14.0, according to jreback on github:

https://github.com/pydata/pandas/issues/9200

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.