List out strings which are substrings of other strings in the list

Question

I have a list of names like so:

dog_bone
dog_collar
dragon
cool_dragon
lion
lion_trainer
dog

I need to extract out names that appear in other names like so:

dragon
lion
dog

I looked through the uniq man page, but it seems to compare entire lines, not strings. Is there a way to do this with a bash function?

If dog, dog_bone, and dog_bones all appear in the file, what should be printed out? — Mark Plotnick
– Mark Plotnick, Commented May 7, 2014 at 17:43
@MarkPlotnick, then both dog and dog_bone would be printed out. — Question Overflow
– Question Overflow, Commented May 8, 2014 at 3:39

Stéphane Chazelas · Accepted Answer · 2014-05-07 14:26:50Z

5

file=/the/file.txt
while IFS= read -r string; do
  grep -Fe "$string" < "$file" | grep -qvxFe "$string" &&
    printf '%s\n' "$string"
done < "$file"

That runs one read, two grep and sometimes one printf commands per line of the file, so is not going to be very efficient.

You can do the whole thing in one awk invocation:

awk '{l[NR]=$0}
     END {
       for (i=1; i<=NR; i++)
         for (j=1; j<=NR; j++)
           if (j!=i && index(l[j], l[i])) {
             print l[i]
             break
           }
     }' < "$file"

though that means the whole file is stored in memory.

edited May 7, 2014 at 14:26

answered May 7, 2014 at 14:10

Stéphane Chazelas

587k96 gold badges1.1k silver badges1.7k bronze badges

Exactly what I need. Excellent stuff :)

Question Overflow
– Question Overflow

2014-05-08 03:41:07 +00:00
Commented May 8, 2014 at 3:41
@stephane It would be better if you explain the awk command little bit.

Avinash Raj
– Avinash Raj

2014-05-08 04:19:19 +00:00
Commented May 8, 2014 at 4:19
1

@AvinashRaj Probably only what index does? "index(in, find) This searches the string in for the first occurrence of the string find, and returns the position in characters where that occurrence begins in the string in."

Bernhard
– Bernhard

2014-05-08 06:19:47 +00:00
Commented May 8, 2014 at 6:19

Add a comment |

glenn jackman · Accepted Answer · 2014-05-07 15:01:17Z

5

bash

names=(
  dog_bone
  dog_collar
  dragon
  cool_dragon
  lion
  lion_trainer
  dog
)

declare -A contained                 # an associative array
for (( i=0; i < ${#names[@]}; i++ )); do 
    for (( j=0; j < ${#names[@]}; j++ )); do 
        if (( i != j )) && [[ ${names[i]} == *"${names[j]}"* ]]; then
            contained["${names[j]}"]=1
        fi 
    done
done
printf "%s\n" "${!contained[@]}"    # print the array keys

dog
dragon
lion

answered May 7, 2014 at 15:01

glenn jackman

88.6k16 gold badges124 silver badges179 bronze badges

Add a comment |

terdon · Accepted Answer · 2014-05-07 14:48:10Z

3

Here's a Perl approach. This also needs to load the file into memory:

perl -le '@f=<>; foreach $l1 (@f){ 
                    chomp($l1); 
                    foreach $l2 (@f){ 
                        chomp($l2); 
                        next if $l1 eq $l2; 
                        $k{$l1}++ if $l2=~/$l1/;
                    }
                } print join "\n", keys %k' file

answered May 7, 2014 at 14:48

terdon♦

253k69 gold badges481 silver badges719 bronze badges

Add a comment |

Community · Accepted Answer · 2017-04-13 12:36:37Z

3

A hacky way to do what you want. I'm not sure if all your examples will include a underscore or not but you could key off of that and use sort | uniq -d to produce a list of substrings that are present more than once within a given file, using the actual file itself as a list of fixed strings to grep, via the -F switch.

Example

$ grep -oFf <(grep -v _ file.txt) file.txt |
    LC_ALL=C sort | LC_ALL=C uniq -d    
dog
dragon
lion

The above works as follows.

<(grep -v _ file.txt) will produce a list of the contents of file.txt omitting the lines that contain a underscore (_).
```
$ grep -v _ file.txt 
dragon
lion
dog
```
grep -oFf <(..) file.txt will use the results of #1 as a list of fixed length strings that grep will find within the file file.txt.
```
$ grep -oFf <(grep -v _ file.txt) file.txt
dog
dog
dragon
dragon
lion
lion
dog
```
The results of this command are then run through the sort & uniq -d commands which will list the entries that occur more than once amongst the results that grep -oFf has produced.

NOTE: If you'd like to understand why you need to enlist the use of LC_ALL=C when performing the sort and uniq calls then take a look at @Stephane's fine answer to this here: What does "LC_ALL=C" do?.

edited Apr 13, 2017 at 12:36

CommunityBot

1

answered May 7, 2014 at 18:10

slm♦

380k127 gold badges793 silver badges897 bronze badges

That's wrong as it is equivalent to grep -v _ file.txt. Using LC_ALL=C sort | LC_ALL=C uniq -d instead of sort -u would work

Stéphane Chazelas
– Stéphane Chazelas

2014-05-07 19:23:13 +00:00
Commented May 7, 2014 at 19:23
@StephaneChazelas - thanks for the feedback. Can you explain what's wrong? I don't understand what you're suggestion is going to change.

slm
– slm ♦

2014-05-07 19:55:33 +00:00
Commented May 7, 2014 at 19:55
grep -of <(grep -v _ file.txt) file.txt will always return the lines that don't contain underscores because they match themselves (you're also missing some -F, but that's another issue).

Stéphane Chazelas
– Stéphane Chazelas

2014-05-07 22:01:49 +00:00
Commented May 7, 2014 at 22:01
@StephaneChazelas - OK I finally understand what LC_ALL=C is doing in all your examples now. I finally stumbled across your A to that Q, funny I'd never seen that one until today. Thanks!

slm
– slm ♦

2014-05-08 02:00:05 +00:00
Commented May 8, 2014 at 2:00
Your answer assumes that one wants to consider whether foo is within foo_bar, but not whether a_b is within a_b_c. It also won't work if there's a foo, and foobar.

Stéphane Chazelas
– Stéphane Chazelas

2014-05-08 06:30:25 +00:00
Commented May 8, 2014 at 6:30

| Show 1 more comment

cuonglm · Accepted Answer · 2014-12-16 10:49:01Z

3

Here is a bash version 4.x solution:

#!/bin/bash

declare -A output
readarray input < '/path/to/file'

for i in "${input[@]}"; do
  for j in "${input[@]}"; do
    [[ $j = "$i" ]] && continue
    if [ -z "${i##*"$j"*}" ]; then
      if [[ ! ${output[$j]} ]]; then
        printf "%s\n" "$j"
        output[$j]=1
      fi
    fi
  done
done

edited Dec 16, 2014 at 10:49

answered May 7, 2014 at 14:57

cuonglm

158k41 gold badges342 silver badges420 bronze badges

I have added your solution over here. Please change if needed. :)

Ramesh
– Ramesh

2014-05-07 15:50:18 +00:00
Commented May 7, 2014 at 15:50
@Ramesh: No, this question is different with yours.

cuonglm
– cuonglm

2014-05-07 15:51:41 +00:00
Commented May 7, 2014 at 15:51
oops. Sorry, I misunderstood the question initially :)

Ramesh
– Ramesh

2014-05-07 15:53:15 +00:00
Commented May 7, 2014 at 15:53

Add a comment |

Stack Exchange Network

List out strings which are substrings of other strings in the list

5 Answers 5

Example

You must log in to answer this question.

Linked

Hot Network Questions

List out strings which are substrings of other strings in the list

5 Answers 5

Example

You must log in to answer this question.

Linked

Related

Hot Network Questions