grep using a character vector with multiple patterns

Question

I am trying to use grep to test whether a vector of strings are present in an another vector or not, and to output the values that are present (the matching patterns).

I have a data frame like this:

FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6

I have a vector of strings patterns to be found in the "Letter" columns, for example: c("A1", "A9", "A6").

I would like to check whether the any of the strings in the pattern vector is present in the "Letter" column. If they are, I would like the output of unique values.

The problem is, I don't know how to use grep with multiple patterns. I tried:

matches <- unique (
    grep("A1| A9 | A6", myfile$Letter, value=TRUE, fixed=TRUE)
)

But it gives me 0 matches which is not true, any suggestions?

You can't use fixed=TRUE cause you pattern is true regular expression. — Marek
– Marek, Commented Oct 5, 2011 at 15:27
Using match or %in% or even == is the only correct way to compare exact matches. regex is very dangerous for such a task and can lead to unexpected results. — David Arenburg
– David Arenburg, Commented Sep 12, 2016 at 5:34

Henrik · Accepted Answer · 2020-06-08 14:03:24Z

346

In addition to @Marek's comment about not including fixed==TRUE, you also need to not have the spaces in your regular expression. It should be "A1|A9|A6".

You also mention that there are lots of patterns. Assuming that they are in a vector

toMatch <- c("A1", "A9", "A6")

Then you can create your regular expression directly using paste and collapse = "|".

matches <- unique (grep(paste(toMatch,collapse="|"), 
                        myfile$Letter, value=TRUE))

edited Jun 8, 2020 at 14:03

Henrik

68k15 gold badges152 silver badges166 bronze badges

answered Oct 5, 2011 at 16:35

Brian Diggs

59.1k14 gold badges169 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user124123 Over a year ago

Any way to do this when your list of strings includes regex operators as punctuation?

Brian Diggs Over a year ago

@user1987097 It should work the same way, with or without any other regex operators. Did you have a specific example this didn't work for?

mbh86 Over a year ago

@user1987097 use 2 backslahes before a dot or bracket. First backslash is an escape character to interpret the second one needed to disable the operator.

David Arenburg Over a year ago

Using regex for exact matches seem dangerous to me and can have unexpected results. Why not just toMatch %in% myfile$Letter ?

Brian Diggs Over a year ago

@user4050 No specific reason. The version in the question had it and I probably just carried it through without thinking about whether it was necessary.

|

Adamm · Accepted Answer · 2017-05-12 08:42:29Z

47

Good answers, however don't forget about filter() from dplyr:

patterns <- c("A1", "A9", "A6")
>your_df
  FirstName Letter
1      Alex     A1
2      Alex     A6
3      Alex     A7
4       Bob     A1
5     Chris     A9
6     Chris     A6

result <- filter(your_df, grepl(paste(patterns, collapse="|"), Letter))

>result
  FirstName Letter
1      Alex     A1
2      Alex     A6
3       Bob     A1
4     Chris     A9
5     Chris     A6

answered May 12, 2017 at 8:42

Adamm

2,30425 silver badges35 bronze badges

3 Comments

Adamm Over a year ago

I think that grepl works with one pattern at the time (we need vector with length 1), we have 3 patterns (vector of length 3), so we can combine them with one using some friendly for grepl separator - |, try your luck with other :)

Ahdee Over a year ago

oh I get it now. So its a compress way to output something like A1 | A2 so if one wanted all conditions then the collapse would be with an & sign, cool thanks.

fabern Over a year ago

Hi, using )|( to separate patterns might make this more robust: paste0("(", paste(patterns, collapse=")|("),")"). Unfortunately it becomes also slightly less elegent. This results in pattern (A1)|(A9)|(A6).

petermeissner · Accepted Answer · 2020-07-21 07:35:09Z

43

This should work:

grep(pattern = 'A1|A9|A6', x = myfile$Letter)

Or even more simply:

library(data.table)
myfile$Letter %like% 'A1|A9|A6'

edited Jul 21, 2020 at 7:35

petermeissner

13k7 gold badges68 silver badges65 bronze badges

answered Nov 1, 2018 at 15:15

BOC

4474 silver badges2 bronze badges

2 Comments

Gregor Thomas Over a year ago

%like% isn't in base R, so you should mention what package(s) are needed to use it.

steveb Over a year ago

For others looking at this answer, %like% is part of the data.table package. Also similar in data.table are like(...), %ilike%, and %flike%.

Austin · Accepted Answer · 2015-10-31 21:35:08Z

10

Based on Brian Digg's post, here are two helpful functions for filtering lists:

#Returns all items in a list that are not contained in toMatch
#toMatch can be a single item or a list of items
exclude <- function (theList, toMatch){
  return(setdiff(theList,include(theList,toMatch)))
}

#Returns all items in a list that ARE contained in toMatch
#toMatch can be a single item or a list of items
include <- function (theList, toMatch){
  matches <- unique (grep(paste(toMatch,collapse="|"), 
                          theList, value=TRUE))
  return(matches)
}

edited Oct 31, 2015 at 21:35

answered Aug 22, 2015 at 20:15

Austin

8,6553 gold badges34 silver badges39 bronze badges

Comments

dwitvliet · Accepted Answer · 2014-07-25 13:17:25Z

6

Have you tried the match() or charmatch() functions?

Example use:

match(c("A1", "A9", "A6"), myfile$Letter)

edited Jul 25, 2014 at 13:17

dwitvliet

7,7317 gold badges40 silver badges66 bronze badges

answered Jul 25, 2014 at 13:16

user3877096

771 silver badge1 bronze badge

1 Comment

steveb Over a year ago

One thing to note with match is that it is not using patterns, it is expecting an exact match.

DryLabRebel · Accepted Answer · 2017-01-23 00:25:23Z

5

To add to Brian Diggs answer.

another way using grepl will return a data frame containing all your values.

toMatch <- myfile$Letter

matches <- myfile[grepl(paste(toMatch, collapse="|"), myfile$Letter), ]

matches

Letter Firstname
1     A1      Alex 
2     A6      Alex 
4     A1       Bob 
5     A9     Chris 
6     A6     Chris

Maybe a bit cleaner... maybe?

edited Jan 23, 2017 at 0:25

answered Jan 23, 2017 at 0:14

DryLabRebel

10.6k3 gold badges21 silver badges26 bronze badges

Comments

BenBarnes · Accepted Answer · 2017-05-12 08:48:18Z

5

Not sure whether this answer has already appeared...

For the particular pattern in the question, you can just do it with a single grep() call,

grep("A[169]", myfile$Letter)

edited May 12, 2017 at 8:48

BenBarnes

19.5k6 gold badges60 silver badges75 bronze badges

answered Apr 19, 2017 at 16:00

Assaf

5355 silver badges6 bronze badges

Comments

dondapati · Accepted Answer · 2018-02-09 07:56:49Z

2

Using the sapply

 patterns <- c("A1", "A9", "A6")
         df <- data.frame(name=c("A","Ale","Al","lex","x"),Letters=c("A1","A2","A9","A1","A9"))



   name Letters
1    A      A1
2  Ale      A2
3   Al      A9
4  lex      A1
5    x      A9


 df[unlist(sapply(patterns, grep, df$Letters, USE.NAMES = F)), ]
  name Letters
1    A      A1
4  lex      A1
3   Al      A9
5    x      A9

answered Feb 9, 2018 at 7:56

dondapati

8597 silver badges18 bronze badges

Comments

Saurabh Chauhan · Accepted Answer · 2018-08-01 10:54:40Z

2

Take away the spaces. So do:

matches <- unique(grep("A1|A9|A6", myfile$Letter, value=TRUE, fixed=TRUE))

edited Aug 1, 2018 at 10:54

Saurabh Chauhan

3,2214 gold badges23 silver badges47 bronze badges

answered May 4, 2018 at 22:26

user9325029

111 bronze badge

Comments

Quinten · Accepted Answer · 2022-07-16 18:01:21Z

Another option would be using the syntax like '\\b(A1|A9|A6)\\b' as the pattern. This is for regular expressions word boundary which comes in hand for example if Bob had the letters for example "A7,A1", when using that syntax, you can still extract the row. Here is a reproducible example for both options:

df <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex     A7
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df, df[grep('\\b(A1|A9|A6)\\b', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

df2 <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7,A1
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df2
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df2, df2[grep('A1|A9|A6', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

^{Created on 2022-07-16 by the reprex package (v2.0.1)}

Please note: if you are using R v4.1+, you can use \\b, otherwise use \b.

Jaap · Accepted Answer · 2017-02-08 11:07:33Z

-1

I suggest writing a little script and doing multiple searches with Grep. I've never found a way to search for multiple patterns, and believe me, I've looked!

Like so, your shell file, with an embedded string:

 #!/bin/bash 
 grep *A6* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A7* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A8* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";

Then run by typing myshell.sh.

If you want to be able to pass in the string on the command line, do it like this, with a shell argument--this is bash notation btw:

 #!/bin/bash 
 $stingtomatch = "${1}";
 grep *A6* "${stingtomatch}";
 grep *A7* "${stingtomatch}";
 grep *A8* "${stingtomatch}";

And so forth.

If there are a lot of patterns to match, you can put it in a for loop.

edited Feb 8, 2017 at 11:07

Jaap

83.6k36 gold badges190 silver badges203 bronze badges

answered Sep 29, 2011 at 13:00

ChrisBean

1391 gold badge1 silver badge3 bronze badges

2 Comments

user971102 Over a year ago

Thank you ChrisBean. The patterns are lots actually, and maybe it would be better to use a file then. I am new to BASH, but maybe something like this should work… #!/bin/bash for i in 'pattern.txt' do echo $i j='grep -c "${i}" myfile.txt' echo $j if [$j -eq o ] then echo $i >> matches.txt fi done

user971102 Over a year ago

doesn't work…the error message is '[grep: command not found'…I have grep in the /bin folder, and /bin is on my $PATH…Not sure what is happening…Can you please help?

Collectives™ on Stack Overflow

grep using a character vector with multiple patterns

11 Answers 11

6 Comments

3 Comments

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

6 Comments

3 Comments

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related