Bash script to efficiently return two file names that both contain a string found in a list

Question

I'm trying to find duplicates of a string ID across files. Each of these IDs are unique and should be used in only one file. I am trying to verify that each ID is only used once, and the script should tell me the ID which is duplicated and in which files.

This is an example of the set.csv file

"Read-only",,"T","ID6776","3.1.1","Text","?"
"Read-only",,"T","ID4294","3.1.1.1","Text","?"
"Read-only","ID","T","ID7294","a )","Text","?"
"Read-only","ID","F","ID8641","b )","Text","?"
"Read-only","ID","F","ID8642","c )","Text","?"
"Read-only","ID","T","ID9209","d )","Text","?"
"Read-only","ID","F","ID3759","3.1.1.2","Text","?"
"Read-only",,"F","ID2156","3.1.1.3","

This is the very inefficient code I wrote

for ID in $(grep 'ID\"\,\"[TF]' set.csv | cut -c 23-31);
do for FILE1 in *.txt; do for FILE2 in *.txt; 
do if [[ $FILE1 -nt $FILE2 && `grep -E '$ID' $FILE1 $FILE2` ]]; 
then echo $ID + $FILE1 + $FILE2; 
fi; 
done; 
done; 
done

Essentially I'm only interested in ID#s that are identified as "ID" in the CSV which would be 7294, 8641, 8642, 9209, 3759 but not the others. If File1 and File2 both contain the same ID from this set then it would print out the duplicated ID and each file that it is found in.

There might be thousands of IDs, and files so my exponential approach isn't at all preferred. If Bash isn't up to it I'll move to sets, hashmaps and a logarithmic searching algorithm in another language... but if the shell can do it I'd like to know how.

Thanks!

Edit: Bonus would be to find which IDs from the set .csv aren't used at all. A pseudo code for another language might be create a set for all the IDs in the csv, then make another set and add to it IDs found in the files, then compare the sets. Can bash accomplish something like this?

Might be better suited for codereview.stackexchange.com since you have something that's working? — interesting-name-here
– interesting-name-here, Commented Jul 26, 2017 at 19:49
Ah, I may check that out, thanks! I'll probably wait a couple days then accept someones answer here (whether I'm informed that an efficient version of this task is possible or not). — Vorkosigan
– Vorkosigan, Commented Jul 26, 2017 at 19:53

Jeff Schaller · Accepted Answer · 2017-07-27 02:18:59Z

1

A linear option would be to use awk to store discovered identifiers with their corresponding filename, then report when an identifier is found again. Assuming

awk -F, '$2 == "\"ID\"" && ($3 == "\"T\"" || $3 == "\"F\"") {
  id=substr($4,4,4)
  if(ids[id]) {
    print id " is in " ids[id] " and " FILENAME;
  } else {
    ids[id]=FILENAME;
  }
}' *.txt

The awk script looks through every *.txt file; it splits the fields based on commas (-F,). If field 2 is "ID" and field 3 is "T" or "F", then it extracts the numeric ID from field 4. If that ID has been seen before, it reports the previous file and the current filename; otherwise, it saves the id with an association to the current filename.

answered Jul 27, 2017 at 2:18

Jeff Schaller

2,6759 gold badges29 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jonathan Leffler Over a year ago

I'm not clear where the condition on T or F came from, but apart from that, this is ideal territory for awk with its field splitting and comparison work.

Jeff Schaller Over a year ago

I took the T/F logic from the grep example.

Jonathan Leffler Over a year ago

Ah…yes, I see. Strictly, the sample checks for 'field starts with double quote and either T or F', but that's probably a mistake. In which case, your work looks good. It certainly should be doable with a single pass over the data, which awk does, compared with the multiple passes with grep in the question.

Vorkosigan Over a year ago

Yes, the [TF] was just part of the regex I was using to verify it was the "ID" string denoting it was a relevant ID, and not part of the ID###. I guess it wasn't explicitly necessary unless the "Text" contained the string "ID"; linear is a great improvement and I will take GibralterTop's suggestion to go to codereview.stackexchange.com for further efficiency related questions. Accepting the answer as using awk.

Vorkosigan Over a year ago

This answer helps me go in the right direction, though the .csv is supposed to be a key with which to compare two .txt files that may have the IDs anywhere, not necessarily in the .csv's format.

Collectives™ on Stack Overflow

Bash script to efficiently return two file names that both contain a string found in a list

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related