185

I am trying to use grep to test whether a vector of strings are present in an another vector or not, and to output the values that are present (the matching patterns).

I have a data frame like this:

FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6

I have a vector of strings patterns to be found in the "Letter" columns, for example: c("A1", "A9", "A6").

I would like to check whether the any of the strings in the pattern vector is present in the "Letter" column. If they are, I would like the output of unique values.

The problem is, I don't know how to use grep with multiple patterns. I tried:

matches <- unique (
    grep("A1| A9 | A6", myfile$Letter, value=TRUE, fixed=TRUE)
)

But it gives me 0 matches which is not true, any suggestions?

2
  • 3
    You can't use fixed=TRUE cause you pattern is true regular expression. Commented Oct 5, 2011 at 15:27
  • 6
    Using match or %in% or even == is the only correct way to compare exact matches. regex is very dangerous for such a task and can lead to unexpected results. Commented Sep 12, 2016 at 5:34

11 Answers 11

346

In addition to @Marek's comment about not including fixed==TRUE, you also need to not have the spaces in your regular expression. It should be "A1|A9|A6".

You also mention that there are lots of patterns. Assuming that they are in a vector

toMatch <- c("A1", "A9", "A6")

Then you can create your regular expression directly using paste and collapse = "|".

matches <- unique (grep(paste(toMatch,collapse="|"), 
                        myfile$Letter, value=TRUE))
Sign up to request clarification or add additional context in comments.

6 Comments

Any way to do this when your list of strings includes regex operators as punctuation?
@user1987097 It should work the same way, with or without any other regex operators. Did you have a specific example this didn't work for?
@user1987097 use 2 backslahes before a dot or bracket. First backslash is an escape character to interpret the second one needed to disable the operator.
Using regex for exact matches seem dangerous to me and can have unexpected results. Why not just toMatch %in% myfile$Letter ?
@user4050 No specific reason. The version in the question had it and I probably just carried it through without thinking about whether it was necessary.
|
47

Good answers, however don't forget about filter() from dplyr:

patterns <- c("A1", "A9", "A6")
>your_df
  FirstName Letter
1      Alex     A1
2      Alex     A6
3      Alex     A7
4       Bob     A1
5     Chris     A9
6     Chris     A6

result <- filter(your_df, grepl(paste(patterns, collapse="|"), Letter))

>result
  FirstName Letter
1      Alex     A1
2      Alex     A6
3       Bob     A1
4     Chris     A9
5     Chris     A6

3 Comments

I think that grepl works with one pattern at the time (we need vector with length 1), we have 3 patterns (vector of length 3), so we can combine them with one using some friendly for grepl separator - |, try your luck with other :)
oh I get it now. So its a compress way to output something like A1 | A2 so if one wanted all conditions then the collapse would be with an & sign, cool thanks.
Hi, using )|( to separate patterns might make this more robust: paste0("(", paste(patterns, collapse=")|("),")"). Unfortunately it becomes also slightly less elegent. This results in pattern (A1)|(A9)|(A6).
43

This should work:

grep(pattern = 'A1|A9|A6', x = myfile$Letter)

Or even more simply:

library(data.table)
myfile$Letter %like% 'A1|A9|A6'

2 Comments

%like% isn't in base R, so you should mention what package(s) are needed to use it.
For others looking at this answer, %like% is part of the data.table package. Also similar in data.table are like(...), %ilike%, and %flike%.
10

Based on Brian Digg's post, here are two helpful functions for filtering lists:

#Returns all items in a list that are not contained in toMatch
#toMatch can be a single item or a list of items
exclude <- function (theList, toMatch){
  return(setdiff(theList,include(theList,toMatch)))
}

#Returns all items in a list that ARE contained in toMatch
#toMatch can be a single item or a list of items
include <- function (theList, toMatch){
  matches <- unique (grep(paste(toMatch,collapse="|"), 
                          theList, value=TRUE))
  return(matches)
}

Comments

6

Have you tried the match() or charmatch() functions?

Example use:

match(c("A1", "A9", "A6"), myfile$Letter)

1 Comment

One thing to note with match is that it is not using patterns, it is expecting an exact match.
5

To add to Brian Diggs answer.

another way using grepl will return a data frame containing all your values.

toMatch <- myfile$Letter

matches <- myfile[grepl(paste(toMatch, collapse="|"), myfile$Letter), ]

matches

Letter Firstname
1     A1      Alex 
2     A6      Alex 
4     A1       Bob 
5     A9     Chris 
6     A6     Chris

Maybe a bit cleaner... maybe?

Comments

5

Not sure whether this answer has already appeared...

For the particular pattern in the question, you can just do it with a single grep() call,

grep("A[169]", myfile$Letter)

Comments

2

Using the sapply

 patterns <- c("A1", "A9", "A6")
         df <- data.frame(name=c("A","Ale","Al","lex","x"),Letters=c("A1","A2","A9","A1","A9"))



   name Letters
1    A      A1
2  Ale      A2
3   Al      A9
4  lex      A1
5    x      A9


 df[unlist(sapply(patterns, grep, df$Letters, USE.NAMES = F)), ]
  name Letters
1    A      A1
4  lex      A1
3   Al      A9
5    x      A9

Comments

2

Take away the spaces. So do:

matches <- unique(grep("A1|A9|A6", myfile$Letter, value=TRUE, fixed=TRUE))

Comments

0

Another option would be using the syntax like '\\b(A1|A9|A6)\\b' as the pattern. This is for regular expressions word boundary which comes in hand for example if Bob had the letters for example "A7,A1", when using that syntax, you can still extract the row. Here is a reproducible example for both options:

df <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex     A7
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df, df[grep('\\b(A1|A9|A6)\\b', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

df2 <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7,A1
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df2
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df2, df2[grep('A1|A9|A6', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

Created on 2022-07-16 by the reprex package (v2.0.1)

Please note: if you are using R v4.1+, you can use \\b, otherwise use \b.

Comments

-1

I suggest writing a little script and doing multiple searches with Grep. I've never found a way to search for multiple patterns, and believe me, I've looked!

Like so, your shell file, with an embedded string:

 #!/bin/bash 
 grep *A6* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A7* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A8* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";

Then run by typing myshell.sh.

If you want to be able to pass in the string on the command line, do it like this, with a shell argument--this is bash notation btw:

 #!/bin/bash 
 $stingtomatch = "${1}";
 grep *A6* "${stingtomatch}";
 grep *A7* "${stingtomatch}";
 grep *A8* "${stingtomatch}";

And so forth.

If there are a lot of patterns to match, you can put it in a for loop.

2 Comments

Thank you ChrisBean. The patterns are lots actually, and maybe it would be better to use a file then. I am new to BASH, but maybe something like this should work… #!/bin/bash for i in 'pattern.txt' do echo $i j='grep -c "${i}" myfile.txt' echo $j if [$j -eq o ] then echo $i >> matches.txt fi done
doesn't work…the error message is '[grep: command not found'…I have grep in the /bin folder, and /bin is on my $PATH…Not sure what is happening…Can you please help?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.