0

Using Python and Pandas I want to find all columns with duplicate rows in a data frame and move them to another data frame. For example I might have:

cats, tigers, 3.5, 1, cars, 2, 5
cats, tigers, 3.5, 6, 7.2, 22.6, 5
cats, tigers, 3.5, test, 2.6, 99, 52.3

And I want cats, tigers, 3.5 in one data frame

cats, tigers, 3.5

and in another data frame I want

   1, cars, 2, 5
   6, 7.2, 22.6, 5
   test, 2.6, 99, 52.3

The code should check every column for repeat rows and only remove columns in which repeats occur in all rows.

  1. Some of the cases none of the columns have repeats.
  2. Some times more than just the first three columns have repeats. It should check all of the columns because repeats can occur in any column

How could I do this?

1
  • can you show us more information ? Commented Oct 21, 2019 at 1:51

2 Answers 2

1

Method 1:
use nunique with dropna=False

m = df.nunique(dropna=False).eq(1)

df_dup = df.iloc[[0], m.values]

Out[121]:
      0       1    2
0  cats  tigers  3.5

df_notdup = df.loc[:, ~m]

Out[123]:
      3     4     5     6
0     1  cars   2.0   5.0
1     6   7.2  22.6   5.0
2  test   2.6  99.0  52.3

Method 2:
Use listcomp and on each columns check duplicated with option keep=False and check all

m = np.array([df[x].duplicated(keep=False).all() for x in df])

df_dup = df.loc[:, m]

Out[65]:
      0       1    2
0  cats  tigers  3.5
1  cats  tigers  3.5
2  cats  tigers  3.5

As @Moys mention, if you want only one row in df_dup, you may use drop_duplicates or simply .head(1) or iloc

df_dup = df.loc[:, m].head(1)

or

df_dup = df.iloc[[0], m]

Out[91]:
      0       1    2
0  cats  tigers  3.5

For not dup rows:

df_notdup = df.loc[:, ~m]

Out[75]:
      3     4     5     6
0     1  cars   2.0   5.0
1     6   7.2  22.6   5.0
2  test   2.6  99.0  52.3
Sign up to request clarification or add additional context in comments.

2 Comments

Good solution. Up-voted. I think you may just want to add .drop_duplicates() to df_dup = df.loc[:, m] to match OP's desired result.
@moys: I added to the answer. However, I prefer head(1) for shorter. I up-voted your solution too :)
1

You can use

df1 = pd.DataFrame(df.val.str.extract('([a-zA-Z ]+)', expand=False).str.strip().drop_duplicates()) #'val' is the column in which you have these values
print(df1)

Output

     val
0   ABCD

and

df2 = pd.DataFrame(df.val.str.extract('([0-9]+)', expand=False).str.strip().drop_duplicates()) #'val' is the column in which you have these values
print(df2)

Output

     val
0   1234
1   6578
2   4432

2 Comments

please provide data that represents your actual data. without it, it is difficult to assume what you have & what you want.
So, you want data till first 3 commas as one dataframe & the rest as another?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.