0

I'm trying to find duplicates of a string ID across files. Each of these IDs are unique and should be used in only one file. I am trying to verify that each ID is only used once, and the script should tell me the ID which is duplicated and in which files.

This is an example of the set.csv file

"Read-only",,"T","ID6776","3.1.1","Text","?"
"Read-only",,"T","ID4294","3.1.1.1","Text","?"
"Read-only","ID","T","ID7294","a )","Text","?"
"Read-only","ID","F","ID8641","b )","Text","?"
"Read-only","ID","F","ID8642","c )","Text","?"
"Read-only","ID","T","ID9209","d )","Text","?"
"Read-only","ID","F","ID3759","3.1.1.2","Text","?"
"Read-only",,"F","ID2156","3.1.1.3","

This is the very inefficient code I wrote

for ID in $(grep 'ID\"\,\"[TF]' set.csv | cut -c 23-31);
do for FILE1 in *.txt; do for FILE2 in *.txt; 
do if [[ $FILE1 -nt $FILE2 && `grep -E '$ID' $FILE1 $FILE2` ]]; 
then echo $ID + $FILE1 + $FILE2; 
fi; 
done; 
done; 
done

Essentially I'm only interested in ID#s that are identified as "ID" in the CSV which would be 7294, 8641, 8642, 9209, 3759 but not the others. If File1 and File2 both contain the same ID from this set then it would print out the duplicated ID and each file that it is found in.

There might be thousands of IDs, and files so my exponential approach isn't at all preferred. If Bash isn't up to it I'll move to sets, hashmaps and a logarithmic searching algorithm in another language... but if the shell can do it I'd like to know how.

Thanks!

Edit: Bonus would be to find which IDs from the set .csv aren't used at all. A pseudo code for another language might be create a set for all the IDs in the csv, then make another set and add to it IDs found in the files, then compare the sets. Can bash accomplish something like this?

2
  • Might be better suited for codereview.stackexchange.com since you have something that's working? Commented Jul 26, 2017 at 19:49
  • Ah, I may check that out, thanks! I'll probably wait a couple days then accept someones answer here (whether I'm informed that an efficient version of this task is possible or not). Commented Jul 26, 2017 at 19:53

1 Answer 1

1

A linear option would be to use awk to store discovered identifiers with their corresponding filename, then report when an identifier is found again. Assuming

awk -F, '$2 == "\"ID\"" && ($3 == "\"T\"" || $3 == "\"F\"") {
  id=substr($4,4,4)
  if(ids[id]) {
    print id " is in " ids[id] " and " FILENAME;
  } else {
    ids[id]=FILENAME;
  }
}' *.txt

The awk script looks through every *.txt file; it splits the fields based on commas (-F,). If field 2 is "ID" and field 3 is "T" or "F", then it extracts the numeric ID from field 4. If that ID has been seen before, it reports the previous file and the current filename; otherwise, it saves the id with an association to the current filename.

Sign up to request clarification or add additional context in comments.

5 Comments

I'm not clear where the condition on T or F came from, but apart from that, this is ideal territory for awk with its field splitting and comparison work.
I took the T/F logic from the grep example.
Ah…yes, I see. Strictly, the sample checks for 'field starts with double quote and either T or F', but that's probably a mistake. In which case, your work looks good. It certainly should be doable with a single pass over the data, which awk does, compared with the multiple passes with grep in the question.
Yes, the [TF] was just part of the regex I was using to verify it was the "ID" string denoting it was a relevant ID, and not part of the ID###. I guess it wasn't explicitly necessary unless the "Text" contained the string "ID"; linear is a great improvement and I will take GibralterTop's suggestion to go to codereview.stackexchange.com for further efficiency related questions. Accepting the answer as using awk.
This answer helps me go in the right direction, though the .csv is supposed to be a key with which to compare two .txt files that may have the IDs anywhere, not necessarily in the .csv's format.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.