python pandas remove duplicate columns

Question

What is the easiest way to remove duplicate columns from a dataframe?

I am reading a text file that has duplicate columns via:

import pandas as pd

df=pd.read_table(fname)

The column names are:

Time, Time Relative, N2, Time, Time Relative, H2, etc...

All the Time and Time Relative columns contain the same data. I want:

Time, Time Relative, N2, H2

All my attempts at dropping, deleting, etc such as:

df=df.T.drop_duplicates().T

Result in uniquely valued index errors:

Reindexing only valid with uniquely valued index objects

Sorry for being a Pandas noob. Any Suggestions would be appreciated.

Additional Details

Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)

data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):

Time    Time Relative [s]    N2[%]    Time    Time Relative [s]    H2[ppm]
2/12/2013 9:20:55 AM    6.177    9.99268e+001    2/12/2013 9:20:55 AM    6.177    3.216293e-005    
2/12/2013 9:21:06 AM    17.689    9.99296e+001    2/12/2013 9:21:06 AM    17.689    3.841667e-005    
2/12/2013 9:21:18 AM    29.186    9.992954e+001    2/12/2013 9:21:18 AM    29.186    3.880365e-005    
... etc ...
2/12/2013 2:12:44 PM    17515.269    9.991756+001    2/12/2013 2:12:44 PM    17515.269    2.800279e-005    
2/12/2013 2:12:55 PM    17526.769    9.991754e+001    2/12/2013 2:12:55 PM    17526.769    2.880386e-005
2/12/2013 2:13:07 PM    17538.273    9.991797e+001    2/12/2013 2:13:07 PM    17538.273    3.131447e-005

Gene Burinsky · Accepted Answer · 2023-02-15 17:40:28Z

842

Here's a one line solution to remove columns based on duplicate column names:

df = df.loc[:,~df.columns.duplicated()].copy()

How it works:

Suppose the columns of the data frame are ['alpha','beta','alpha']

df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.

Note: the above only checks columns names, not column values.

To remove duplicated indexes

Since it is similar enough, do the same thing on the index:

df = df.loc[~df.index.duplicated(),:].copy()

To remove duplicates by checking values without transposing

Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.

df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()

This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:

# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312))) 


#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs

# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]

# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()

edited Feb 15, 2023 at 17:40

answered Nov 5, 2016 at 6:15

Gene Burinsky

10.3k2 gold badges24 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

GrimSqueaker Over a year ago

An ideal answer would also work for duplicated values, not just names.

John Zwinck Over a year ago

@GrimSqueaker: If you want to consider whether the values are duplicated, you want something like df.T.drop_duplicates().T.

Eelco van Vliet Over a year ago

@JohnZwinck: this only works for small dataframes, as there is a limit to the number of columns you can have. For me it failed for a dataframe with 100,000 rows for instance, as this yields 100,000 columns after transposing, which is not possible

DrWhat Over a year ago

NOTE: the 'remove duplicates by checking values without transposing' is cool, but it does not check each column against another single column, but against all other columns. I think this example is correct: if col1 is [0,1,0], col2 [1,0,1] and col3 [1,1,0], then col3 will be deleted.

Gene Burinsky Over a year ago

@drwhat that's a good point and may arise in practice, especially when data are discrete or binary. I'll add the caveat for now until a better alternative can be devised

|

beardc · Accepted Answer · 2013-02-20 17:49:53Z

50

It sounds like you already know the unique column names. If that's the case, then df = df['Time', 'Time Relative', 'N2'] would work.

If not, your solution should work:

In [101]: vals = np.random.randint(0,20, (4,3))
          vals
Out[101]:
array([[ 3, 13,  0],
       [ 1, 15, 14],
       [14, 19, 14],
       [19,  5,  1]])

In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
          df
Out[106]:
   Time  H1  N2  Time Relative  N2  Time
0     3  13   0              3  13     0
1     1  15  14              1  15    14
2    14  19  14             14  19    14
3    19   5   1             19   5     1

In [107]: df.T.drop_duplicates().T
Out[107]:
   Time  H1  N2
0     3  13   0
1     1  15  14
2    14  19  14
3    19   5   1

You probably have something specific to your data that's messing it up. We could give more help if there's more details you could give us about the data.

Edit: Like Andy said, the problem is probably with the duplicate column titles.

For a sample table file 'dummy.csv' I made up:

Time    H1  N2  Time    N2  Time Relative
3   13  13  3   13  0
1   15  15  1   15  14
14  19  19  14  19  14
19  5   5   19  5   1

using read_table gives unique columns and works properly:

In [151]: df2 = pd.read_table('dummy.csv')
          df2
Out[151]:
         Time  H1  N2  Time.1  N2.1  Time Relative
      0     3  13  13       3    13              0
      1     1  15  15       1    15             14
      2    14  19  19      14    19             14
      3    19   5   5      19     5              1
In [152]: df2.T.drop_duplicates().T
Out[152]:
             Time  H1  Time Relative
          0     3  13              0
          1     1  15             14
          2    14  19             14
          3    19   5              1

If your version doesn't let your, you can hack together a solution to make them unique:

In [169]: df2 = pd.read_table('dummy.csv', header=None)
          df2
Out[169]:
              0   1   2     3   4              5
        0  Time  H1  N2  Time  N2  Time Relative
        1     3  13  13     3  13              0
        2     1  15  15     1  15             14
        3    14  19  19    14  19             14
        4    19   5   5    19   5              1
In [171]: from collections import defaultdict
          col_counts = defaultdict(int)
          col_ix = df2.first_valid_index()
In [172]: cols = []
          for col in df2.ix[col_ix]:
              cnt = col_counts[col]
              col_counts[col] += 1
              suf = '_' + str(cnt) if cnt else ''
              cols.append(col + suf)
          cols
Out[172]:
          ['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
          df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
          Time  H1  N2 Time_1 N2_1 Time Relative
        1    3  13  13      3   13             0
        2    1  15  15      1   15            14
        3   14  19  19     14   19            14
        4   19   5   5     19    5             1
In [178]: df2.T.drop_duplicates().T
Out[178]:
          Time  H1 Time Relative
        1    3  13             0
        2    1  15            14
        3   14  19            14
        4   19   5             1

edited Feb 20, 2013 at 17:49

answered Feb 20, 2013 at 17:06

beardc

21.2k20 gold badges81 silver badges97 bronze badges

5 Comments

Andy Hayden Over a year ago

Unfortunately df['Time'] selects all Time series (i.e. returns a DataFrame), and df['Time', ..] this will return the entire DataFrame.

beardc Over a year ago

Yeah, it's pretty tedious...hopefully it's just a version difference.

Petergavinkin Over a year ago

Using double transposes could have unintended side effects like converting numeric types to objects in the case that you have a df with mixed types. See: stackoverflow.com/questions/24682396/…

Him Over a year ago

This solution give me problems on large dataframes: RecursionError: maximum recursion depth exceeded

Kush Patel Over a year ago

Transpose of large data frame will be slow process

kalu · Accepted Answer · 2015-10-06 18:15:09Z

19

Transposing is inefficient for large DataFrames. Here is an alternative:

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []
    for t, v in groups.items():
        dcols = frame[v].to_dict(orient="list")

        vs = dcols.values()
        ks = dcols.keys()
        lvs = len(vs)

        for i in range(lvs):
            for j in range(i+1,lvs):
                if vs[i] == vs[j]: 
                    dups.append(ks[i])
                    break

    return dups

Use it like this:

dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)

Edit

A memory efficient version that treats nans like any other value:

from pandas.core.common import array_equivalent

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():

        cs = frame[v].columns
        vs = frame[v]
        lcs = len(cs)

        for i in range(lcs):
            ia = vs.iloc[:,i].values
            for j in range(i+1, lcs):
                ja = vs.iloc[:,j].values
                if array_equivalent(ia, ja):
                    dups.append(cs[i])
                    break

    return dups

edited Oct 6, 2015 at 18:15

answered Oct 6, 2015 at 3:16

kalu

2,6821 gold badge23 silver badges22 bronze badges

1 Comment

goidelg Over a year ago

array_equivalent is deprecated. Would love an update for this code.

Elliott Collins · Accepted Answer · 2016-04-09 05:53:28Z

15

If I'm not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than @kalu 's function, keeping the first of any similarly named columns.

Cols = list(df.columns)
for i,item in enumerate(df.columns):
    if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)

answered Apr 9, 2016 at 5:53

Elliott Collins

1841 silver badge7 bronze badges

2 Comments

NuValue Over a year ago

Your solution doesn't work in my case, it shows me: "ValueError: labels ['toDROP'] not contained in axis" after executing the last line

Portfedh Over a year ago

It worked for me, but it came with this warning: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only. df1 = df1.drop("toDROP",1)

Tony B · Accepted Answer · 2019-11-19 20:10:52Z

6

It looks like you were on the right path. Here is the one-liner you were looking for:

df.reset_index().T.drop_duplicates().T

But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:

original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T

answered Nov 19, 2019 at 20:10

Tony B

3861 gold badge3 silver badges12 bronze badges

Comments

CircArgs · Accepted Answer · 2020-09-30 00:52:09Z

5

Note that Gene Burinsky's answer (at the time of writing the selected answer) keeps the first of each duplicated column. To keep the last:

df=df.loc[:, ~df.columns[::-1].duplicated()[::-1]]

answered Sep 30, 2020 at 0:52

CircArgs

6509 silver badges18 bronze badges

1 Comment

Gene Burinsky Over a year ago

You don't have to do [::-1], check out Pandas documentation on the duplicated. Instead, just do df.columns.duplicate(keep='last') for a simpler solution of df = df.loc[:,~df.columns.duplicated(keep='last')]

RoyalTS · Accepted Answer · 2020-11-17 10:11:50Z

4

An update on @kalu's answer, which uses the latest pandas:

def find_duplicated_columns(df):
    dupes = []

    columns = df.columns

    for i in range(len(columns)):
        col1 = df.iloc[:, i]
        for j in range(i + 1, len(columns)):
            col2 = df.iloc[:, j]
            # break early if dtypes aren't the same (helps deal with
            # categorical dtypes)
            if col1.dtype is not col2.dtype:
                break
            # otherwise compare values
            if col1.equals(col2):
                dupes.append(columns[i])
                break

    return dupes

answered Nov 17, 2020 at 10:11

RoyalTS

10.3k14 gold badges65 silver badges111 bronze badges

2 Comments

João Bravo Over a year ago

This is around 10 times slower than @kalu's answer

algoquant Over a year ago

Slow is a feature of Python, not a bug. ;-)

emso · Accepted Answer · 2021-10-01 07:38:42Z

4

Although @Gene Burinsky answer is great, it has a potential problem in that the reassigned df may be either a copy or a view of the original df. This means that subsequent assignments like df['newcol'] = 1 generate a SettingWithCopy warning and may fail (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). The following solution prevents that issue:

duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)

answered Oct 1, 2021 at 7:38

emso

635 bronze badges

1 Comment

Gene Burinsky Over a year ago

In practice, I found using drop is a pain on larger data frames because it is slow. To avoid the warning, just append the command with .copy() like so df.loc[:,~df.columns.duplicated()].copy()

Edmund's Echo · Accepted Answer · 2021-03-27 13:25:05Z

2

I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.

The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix. This way, I could then reference and delete the column without the data.

- E

March 2021 update

The subsequent post by @CircArgs may have provided a succinct one-liner to accomplish what I described here.

edited Mar 27, 2021 at 13:25

answered Nov 15, 2019 at 4:57

Edmund's Echo

91612 silver badges19 bronze badges

Comments

kamran kausar · Accepted Answer · 2019-04-23 07:02:44Z

0

First step:- Read first row i.e all columns the remove all duplicate columns.

Second step:- Finally read only that columns.

cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)

answered Apr 23, 2019 at 7:02

kamran kausar

4,6532 gold badges25 silver badges17 bronze badges

Comments

Joe · Accepted Answer · 2020-03-13 01:59:46Z

0

The way below will identify dupe columns to review what is going wrong building the dataframe originally.

dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]

answered Mar 13, 2020 at 1:59

Joe

312 bronze badges

Comments

Baffometo · Accepted Answer · 2021-09-06 03:33:52Z

Just in case somebody still looking for an answer in how to look for duplicated values in columns for a Pandas Data Frame in Python, I came up with this solution:

def get_dup_columns(m):
    '''
    This will check every column in data frame 
    and verify if you have duplicated columns.
    can help whenever you are cleaning big data sets of 50+ columns 
    and clean up a little  bit for you
    The result will be a list of tuples showing what columns are duplicates
    for example
    (column A, Column C)
    That means that column A is duplicated with column C
    more info go to https://wanatux.com
    '''
    headers_list = [x for x in m.columns]
    duplicate_col2 = []
    y = 0
    while y <= len(headers_list)-1:
        for x in range(1,len(headers_list)-1):
            if m[headers_list[y]].equals(m[headers_list[x]]) == False:        
                continue
            else:
                duplicate_col2.append((headers_list[y],headers_list[x]))
        headers_list.pop(0)  
    return duplicate_col2

And you can cast the definition like this:

duplicate_col = get_dup_columns(pd_excel)

It will show a result like the following:

 [('column a', 'column k'),
 ('column a', 'column r'),
 ('column h', 'column m'),
 ('column k', 'column r')]

Matheus Araujo · Accepted Answer · 2021-10-29 18:12:47Z

0

I am not sure why Gene Burinsky's answer did not work for me. I was getting the same original dataframes with duplicated columns. My workaround was force the selection over the ndarray and get back the dataframe.

df = pd.DataFrame(df.values[:,~df.columns.duplicated()], columns=df.columns[~df.columns.duplicated()])

answered Oct 29, 2021 at 18:12

Matheus Araujo

5,7892 gold badges24 silver badges25 bronze badges

Comments

Marco Cerliani · Accepted Answer · 2021-12-01 11:12:38Z

0

A simple column-wise comparison is the most efficient way (in terms of memory and time) to check duplicated columns by values. Here an example:

import numpy as np
import pandas as pd
from itertools import combinations as combi

df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d'])
df['a'] = df['d'].copy()  # column 'a' is equal to column 'd'

# to keep the first
dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]

# to keep the last
dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
            
df = df.drop(columns=dupli_cols)

edited Dec 1, 2021 at 11:12

answered Dec 1, 2021 at 11:05

Marco Cerliani

22.1k3 gold badges58 silver badges60 bronze badges

Comments

Catruc Iurie · Accepted Answer · 2023-01-23 11:39:45Z

0

In case you want to check for duplicate columns, this code can be useful

columns_to_drop= []

for cname in sorted(list(df)):
    for cname2 in sorted(list(df))[::-1]:
        if df[cname].equals(df[cname2]) and cname!=cname2 and cname not in columns_to_drop:
            columns_to_drop.append(cname2)
            print(cname,cname2,'Are equal')

df = df.drop(columns_to_drop, axis=1)

answered Jan 23, 2023 at 11:39

Catruc Iurie

1091 silver badge6 bronze badges

Comments

moirana0620 · Accepted Answer · 2024-07-24 13:51:37Z

0

If you're sure that the values in the columns are the same and you only want to filter based on the column name, you can try this:

# generate a list of unique column names
l_h = list(set(df.columns.tolist()))
# determine list length to be used for filtering the data frame
index = len(l_h)
# filter dataframe
df[l_h].iloc[:,-index:]

answered Jul 24, 2024 at 13:51

moirana0620

211 bronze badge

Comments

JRP · Accepted Answer · 2025-06-09 16:50:39Z

0

Maybe out of scope, but if we're allowed to use numpy, here's a simple solution.

import numpy  
import pandas as pd  
colsToKeep=np.unique(df.columns,return_index=True)[1]  
df=df.iloc[:,colsToKeep]

answered Jun 9 at 16:50

JRP

566 bronze badges

Comments

Taylrl · Accepted Answer · 2022-09-23 10:50:02Z

-2

Fast and easy way to drop the duplicated columns by their values:

df = df.T.drop_duplicates().T

More info: Pandas DataFrame drop_duplicates manual .

edited Sep 23, 2022 at 10:50

Taylrl

3,9596 gold badges37 silver badges45 bronze badges

answered Apr 30, 2020 at 8:56

jaqm

1

Collectives™ on Stack Overflow

python pandas remove duplicate columns

18 Answers 18

To remove duplicated indexes

To remove duplicates by checking values without transposing

9 Comments

5 Comments

1 Comment

2 Comments

Comments

1 Comment

2 Comments

1 Comment

March 2021 update

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

18 Answers 18

To remove duplicated indexes

To remove duplicates by checking values without transposing

9 Comments

5 Comments

1 Comment

2 Comments

Comments

1 Comment

2 Comments

1 Comment

March 2021 update

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related