2

sorry for the dumb question I am new to python and pandas.

Imagine I've got a csv file with some data for every row, for example :

data1, data2, data3, data4

There are no headings, just data, and I need to remove some rows inside such file if

(row1.data3 and row1.data4) == (row2.data3 and row2.data4) 

the entire row gets removed.

How can I achieve that?

I did try to use remove_duplicates but without headings I don't know how to do it.

cheers

2
  • Just to make sure, you're resetting the dataframe after remove_duplicates, right? remove_duplicates does not work inplace unless you ask it to. Headings wouldn't matter much here. If a row is a duplicate of another row and they are the same data type remove_duplicates should remove it. Commented May 7, 2017 at 0:19
  • Show us the code you have so far. Commented May 7, 2017 at 0:20

1 Answer 1

3

Let's say you happen to have a df without header:

df = pd.read_csv("./try.csv", header=None)
df
# The first row is integers inserted instead of missing column names 
    0   1   2
0   1   1   1
1   1   1   1
2   2   1   3
3   2   1   3
4   3   2   3
5   3   3   3

Then, you can drop_duplicates on subsets of columns:

df.drop_duplicates([0])
    0   1   2
0   1   1   1
2   2   1   3
4   3   2   3

or

df.drop_duplicates([0,1])

    0   1   2
0   1   1   1
2   2   1   3
4   3   2   3
5   3   3   3

Do not forget to assign the result to a new variable or add inplace=True

Sign up to request clarification or add additional context in comments.

1 Comment

@user1583007 Why not accept the answer if it worked for you?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.