I am trying to write Python Pandas code to merge the data in two DataFrames, with the new DataFrame's data replacing the old DataFrame's data if the index and columns are identical. There seems to be a bug in Pandas that sometimes causes the column names to be mixed up.
Here is an example. First, create the two DataFrames:
In [1]: df1 = DataFrame([[1, 2, 3, 4]]*3, columns=["A1", "B2", "C3", "D4"], index=[0, 1, 2])
In [2]: df2 = DataFrame([[30, 10, 40, 20]]*3, columns=["C3", "A1", "D4", "B2"], index=[1, 2, 3])
In [3]: df1
Out[3]:
A1 B2 C3 D4
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
[3 rows x 4 columns]
In [4]: df2
Out[4]:
C3 A1 D4 B2
1 30 10 40 20
2 30 10 40 20
3 30 10 40 20
[3 rows x 4 columns]
Observe that df2 has the same columns but in a different order. The data is the same as 10*df1.
Now merge them:
In [5]: merge_df = DataFrame(index=df1.index.union(df2.index), columns=df1.columns.union(df2.columns))
In [6]: merge_df.loc[df1.index, df1.columns] = df1
In [7]: merge_df.loc[df2.index, df2.columns] = df2
In [8]: merge_df
Out[8]:
A1 B2 C3 D4
0 1 2 3 4
1 10 20 30 40
2 10 20 30 40
3 10 20 30 40
[4 rows x 4 columns]
This works as expected.
Now redefine df2 so that it has a similar index as df1.
In [9]: df2 = DataFrame([[30, 10, 40, 20]]*3, columns=["C3", "A1", "D4", "B2"], index=[0, 1, 2])
In [10]: df2
Out[10]:
C3 A1 D4 B2
0 30 10 40 20
1 30 10 40 20
2 30 10 40 20
[3 rows x 4 columns]
Then merge using the same code as before:
In [11]: merge_df = DataFrame(index=df1.index.union(df2.index), columns=df1.columns.union(df2.columns))
In [12]: merge_df.loc[df1.index, df1.columns] = df1
In [13]: merge_df.loc[df2.index, df2.columns] = df2
In [14]: merge_df
Out[14]:
A1 B2 C3 D4
0 30 10 40 20
1 30 10 40 20
2 30 10 40 20
[3 rows x 4 columns]
Why are the column names and data mixed up? Am I using .loc wrong? Changing that last line to .ix does not fix the problem. It only works if I do this:
In [15]: merge_df = DataFrame(index=df1.index.union(df2.index), columns=df1.columns.union(df2.columns))
In [16]: merge_df.loc[df1.index, df1.columns] = df1
In [17]: merge_df[df2.columns] = df2
In [18]: merge_df
Out[18]:
A1 B2 C3 D4
0 10 20 30 40
1 10 20 30 40
2 10 20 30 40
[3 rows x 4 columns]
That is the desired result.
I may be doing something wrong here, but if I am, there is something important I do not understand about DataFrames and I could be making similar mistakes elsewhere in my code. If that is the case, please explain.
I can't check the Pandas gitbug bugtracker as that website is blocked from work. Any help would be appreciated.
In [19]: pd.__version__
Out[19]: '0.13.1'