8

I have a list of names like so:

dog_bone
dog_collar
dragon
cool_dragon
lion
lion_trainer
dog

I need to extract out names that appear in other names like so:

dragon
lion
dog

I looked through the uniq man page, but it seems to compare entire lines, not strings. Is there a way to do this with a bash function?

2
  • 1
    If dog, dog_bone, and dog_bones all appear in the file, what should be printed out? Commented May 7, 2014 at 17:43
  • @MarkPlotnick, then both dog and dog_bone would be printed out. Commented May 8, 2014 at 3:39

5 Answers 5

5
file=/the/file.txt
while IFS= read -r string; do
  grep -Fe "$string" < "$file" | grep -qvxFe "$string" &&
    printf '%s\n' "$string"
done < "$file"

That runs one read, two grep and sometimes one printf commands per line of the file, so is not going to be very efficient.

You can do the whole thing in one awk invocation:

awk '{l[NR]=$0}
     END {
       for (i=1; i<=NR; i++)
         for (j=1; j<=NR; j++)
           if (j!=i && index(l[j], l[i])) {
             print l[i]
             break
           }
     }' < "$file"

though that means the whole file is stored in memory.

3
  • Exactly what I need. Excellent stuff :) Commented May 8, 2014 at 3:41
  • @stephane It would be better if you explain the awk command little bit. Commented May 8, 2014 at 4:19
  • 1
    @AvinashRaj Probably only what index does? "index(in, find) This searches the string in for the first occurrence of the string find, and returns the position in characters where that occurrence begins in the string in." Commented May 8, 2014 at 6:19
5

bash

names=(
  dog_bone
  dog_collar
  dragon
  cool_dragon
  lion
  lion_trainer
  dog
)

declare -A contained                 # an associative array
for (( i=0; i < ${#names[@]}; i++ )); do 
    for (( j=0; j < ${#names[@]}; j++ )); do 
        if (( i != j )) && [[ ${names[i]} == *"${names[j]}"* ]]; then
            contained["${names[j]}"]=1
        fi 
    done
done
printf "%s\n" "${!contained[@]}"    # print the array keys
dog
dragon
lion
3

Here's a Perl approach. This also needs to load the file into memory:

perl -le '@f=<>; foreach $l1 (@f){ 
                    chomp($l1); 
                    foreach $l2 (@f){ 
                        chomp($l2); 
                        next if $l1 eq $l2; 
                        $k{$l1}++ if $l2=~/$l1/;
                    }
                } print join "\n", keys %k' file
3

A hacky way to do what you want. I'm not sure if all your examples will include a underscore or not but you could key off of that and use sort | uniq -d to produce a list of substrings that are present more than once within a given file, using the actual file itself as a list of fixed strings to grep, via the -F switch.

Example

$ grep -oFf <(grep -v _ file.txt) file.txt |
    LC_ALL=C sort | LC_ALL=C uniq -d    
dog
dragon
lion

The above works as follows.

  1. <(grep -v _ file.txt) will produce a list of the contents of file.txt omitting the lines that contain a underscore (_).

    $ grep -v _ file.txt 
    dragon
    lion
    dog
    
  2. grep -oFf <(..) file.txt will use the results of #1 as a list of fixed length strings that grep will find within the file file.txt.

    $ grep -oFf <(grep -v _ file.txt) file.txt
    dog
    dog
    dragon
    dragon
    lion
    lion
    dog
    
  3. The results of this command are then run through the sort & uniq -d commands which will list the entries that occur more than once amongst the results that grep -oFf has produced.

NOTE: If you'd like to understand why you need to enlist the use of LC_ALL=C when performing the sort and uniq calls then take a look at @Stephane's fine answer to this here: What does "LC_ALL=C" do?.

6
  • That's wrong as it is equivalent to grep -v _ file.txt. Using LC_ALL=C sort | LC_ALL=C uniq -d instead of sort -u would work Commented May 7, 2014 at 19:23
  • @StephaneChazelas - thanks for the feedback. Can you explain what's wrong? I don't understand what you're suggestion is going to change. Commented May 7, 2014 at 19:55
  • grep -of <(grep -v _ file.txt) file.txt will always return the lines that don't contain underscores because they match themselves (you're also missing some -F, but that's another issue). Commented May 7, 2014 at 22:01
  • @StephaneChazelas - OK I finally understand what LC_ALL=C is doing in all your examples now. I finally stumbled across your A to that Q, funny I'd never seen that one until today. Thanks! Commented May 8, 2014 at 2:00
  • Your answer assumes that one wants to consider whether foo is within foo_bar, but not whether a_b is within a_b_c. It also won't work if there's a foo, and foobar. Commented May 8, 2014 at 6:30
3

Here is a bash version 4.x solution:

#!/bin/bash

declare -A output
readarray input < '/path/to/file'

for i in "${input[@]}"; do
  for j in "${input[@]}"; do
    [[ $j = "$i" ]] && continue
    if [ -z "${i##*"$j"*}" ]; then
      if [[ ! ${output[$j]} ]]; then
        printf "%s\n" "$j"
        output[$j]=1
      fi
    fi
  done
done
3
  • I have added your solution over here. Please change if needed. :) Commented May 7, 2014 at 15:50
  • @Ramesh: No, this question is different with yours. Commented May 7, 2014 at 15:51
  • oops. Sorry, I misunderstood the question initially :) Commented May 7, 2014 at 15:53

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.