I will be combining a number of CSV files. What I am trying to do is to:
1) Remove duplicate rows from the file, however, I need to check multiple columns as the criteria for what consists as a duplicate. How do I do that?
2) It would be nice to then create a 2nd output file to see what was removed in case something was removed that was not supposed to be removed.
3) Create a list of items as an input file to run as a (if this row contains this word in a this particular column, then remove the entire row.
If someone could help me with the commands to do this, that would be great! Please let me know if I need to clarify.
Here is a sample of what the data looks like (here is an example as suggested):
I have a csv file like this :
column1 column2
john kerry
adam stephenson
ashley hudson
john kerry
etc..
I want to remove duplicates from this file, to get only for the question at 1:
column1 column2
john kerry
adam stephenson
ashley hudson
For question 3, I want to take the 2nd list...meaning the output of the 1st list and scrub this futher. I want a file like input.txt that contains:
adam
Then, the final output will be:
column1 column2
john kerry
ashley hudson
So, the input.txt file in the example contains the word adam (this way I can make a long list of words to check in the input.txt file). For #3, I need a code snipet that will check column 1 of all lines of the CSV for all the words input file, then remove any matches from the csv.
uniqdoes not work)?