0

I have many dataframes. I need to compare the columns across all of them. Ideally I want to return the positions of those columns that do not match as well as the name of columns from df1 and df2 being compared.

Note: I only need to compare the columns, not the data.

Example:


df1 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
df2 = pd.DataFrame(columns=['Nombre', 'Age', 'Address', 'Telefono'])
df3 = pd.DataFrame(columns=['N.', 'A.', 'Address', 'Telephone', 'Email'])
df4 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])

Expected output:

DataFrame_A DataFrame_B Positions Col_df_A Col_df_B
df1 df2 0,3 ['Name', 'Telephone'] ['Nombre', 'Telefono']
df1 df3 0,1,4 ['Name', 'Age'] ['N.', 'A.', 'Email'

What is best way to do this?

from itertools import combinations

dfs_dict = {"df1": df1, "df2": df2, "df3": df3, "df4": df4}

dfs_names = dfs_dict.keys().tolist()

for df1, df2 in combinations(dfs_names, 2):
...

2 Answers 2

1

pandas.DataFrame instances have .columns attribute which you can use following way

import pandas as pd
df1 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
df2 = pd.DataFrame(columns=['Nombre', 'Age', 'Address', 'Telefono'])
df3 = pd.DataFrame(columns=['N.', 'A.', 'Address', 'Telephone', 'Email'])
df4 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
print(df1.columns != df2.columns)  # [ True False False  True]
print(df1.columns != df4.columns)  # [False False False False]

comparison gives True where names are different else False, result is numpy.array so you can easily find position of Trues as follows

import numpy as np
print(np.argwhere(df1.columns != df2.columns).ravel())  # [0 3]
print(np.argwhere(df1.columns != df4.columns).ravel())  # []

I use .ravel() to get flat structure, as we do not need to care about multiple dimensions in this case. Disclaimer: proposed solution is limited to pandas.DataFrames where number of columns is exactly equal.

Sign up to request clarification or add additional context in comments.

Comments

0
dfs_dict = {"df1": df1.columns, "df2": df2.columns, "df3": df3.columns, "df4": df4.columns}
rows = []

for df_name, columns_1 in dfs_dict.items():
  for df_name2, columns_2 in dfs_dict.items():
    if df_name != df_name2:
      differents_A = columns_1.difference(columns_2)
      differents_B = columns_2.difference(columns_1)
      positions = [columns_1.get_loc(column) for column in differents_A]
      rows.append({"Dataframe_A": df_name,
               "Dataframe_B": df_name2,
               "Positions": positions,
               "Col_df_A": list(differents_A),
               "Col_df_B": list(differents_B)})
  
df_diff = pd.DataFrame(rows)
print(df_diff)

Output:

    Dataframe_A Dataframe_B Positions   Col_df_A                Col_df_B
0           df1         df2 [0, 3]      [Name, Telephone]       [Nombre, Telefono]
1           df1         df3 [1, 0]      [Age, Name]             [A., Email, N.]
2           df1         df4 []          []                      []
3           df2         df1 [0, 3]      [Nombre, Telefono]      [Name, Telephone]
4           df2         df3 [1, 0, 3]   [Age, Nombre, Telefono] [A., Email, N., Telephone]
5           df2         df4 [0, 3]      [Nombre, Telefono]      [Name, Telephone]
6           df3         df1 [1, 4, 0]   [A., Email, N.]         [Age, Name]
7           df3         df2 [1, 4, 0, 3][A., Email, N., Telephone]  [Age, Nombre, Telefono]
8           df3         df4 [1, 4, 0]   [A., Email, N.]         [Age, Name]
9           df4         df1 []          []                      []
10          df4         df2 [0, 3]      [Name, Telephone]       [Nombre, Telefono]
11          df4         df3 [1, 0]      [Age, Name]             [A., Email, N.]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.