Moving duplicate rows from a subset of columns to another data frame in Python

Question

Using Python and Pandas I want to find all columns with duplicate rows in a data frame and move them to another data frame. For example I might have:

cats, tigers, 3.5, 1, cars, 2, 5
cats, tigers, 3.5, 6, 7.2, 22.6, 5
cats, tigers, 3.5, test, 2.6, 99, 52.3

And I want cats, tigers, 3.5 in one data frame

cats, tigers, 3.5

and in another data frame I want

   1, cars, 2, 5
   6, 7.2, 22.6, 5
   test, 2.6, 99, 52.3

The code should check every column for repeat rows and only remove columns in which repeats occur in all rows.

Some of the cases none of the columns have repeats.
Some times more than just the first three columns have repeats. It should check all of the columns because repeats can occur in any column

How could I do this?

can you show us more information ?

BENY
– BENY

2019-10-21 01:51:19 +00:00
Commented Oct 21, 2019 at 1:51 — BENY
– BENY, Commented Oct 21, 2019 at 1:51

Andy L. · Accepted Answer · 2019-10-21 03:13:58Z

1

Method 1:
use nunique with dropna=False

m = df.nunique(dropna=False).eq(1)

df_dup = df.iloc[[0], m.values]

Out[121]:
      0       1    2
0  cats  tigers  3.5

df_notdup = df.loc[:, ~m]

Out[123]:
      3     4     5     6
0     1  cars   2.0   5.0
1     6   7.2  22.6   5.0
2  test   2.6  99.0  52.3

Method 2:
Use listcomp and on each columns check duplicated with option keep=False and check all

m = np.array([df[x].duplicated(keep=False).all() for x in df])

df_dup = df.loc[:, m]

Out[65]:
      0       1    2
0  cats  tigers  3.5
1  cats  tigers  3.5
2  cats  tigers  3.5

As @Moys mention, if you want only one row in df_dup, you may use drop_duplicates or simply .head(1) or iloc

df_dup = df.loc[:, m].head(1)

or

df_dup = df.iloc[[0], m]

Out[91]:
      0       1    2
0  cats  tigers  3.5

For not dup rows:

df_notdup = df.loc[:, ~m]

Out[75]:
      3     4     5     6
0     1  cars   2.0   5.0
1     6   7.2  22.6   5.0
2  test   2.6  99.0  52.3

edited Oct 21, 2019 at 3:13

answered Oct 21, 2019 at 2:29

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

moys Over a year ago

Good solution. Up-voted. I think you may just want to add .drop_duplicates() to df_dup = df.loc[:, m] to match OP's desired result.

Andy L. Over a year ago

@moys: I added to the answer. However, I prefer head(1) for shorter. I up-voted your solution too :)

moys · Accepted Answer · 2019-10-21 01:58:49Z

1

You can use

df1 = pd.DataFrame(df.val.str.extract('([a-zA-Z ]+)', expand=False).str.strip().drop_duplicates()) #'val' is the column in which you have these values
print(df1)

Output

     val
0   ABCD

and

df2 = pd.DataFrame(df.val.str.extract('([0-9]+)', expand=False).str.strip().drop_duplicates()) #'val' is the column in which you have these values
print(df2)

Output

answered Oct 21, 2019 at 1:58

moys

8,1173 gold badges19 silver badges51 bronze badges

2 Comments

moys Over a year ago

please provide data that represents your actual data. without it, it is difficult to assume what you have & what you want.

moys Over a year ago

So, you want data till first 3 commas as one dataframe & the rest as another?

Collectives™ on Stack Overflow

Moving duplicate rows from a subset of columns to another data frame in Python

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related