1

Okay - maybe this is a better example. I am looking for guidance/references on how to reference a variable within a regex - not how to build a regex for this data.

How can you use a value from a variable to regex the next variable?

library(plyr)    
library(tm)
library(stringr)
library(gsubfn)

Dataset of velocities

d1$sub <- c("LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 50-55% (0-49)LESS THAN 50%     COMMON:", "LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 60-70% (0-49)LESS THAN 50% COMMON:", "LEFT CAROTID STENOSIS: (40-60)APPROXIMATELY 40% INCOMPLETE SCAN SEE NOTES (40-50)LESS THAN 50% COMMON:")

d1$sub
[1] "LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 50-55% (0-49)LESS THAN 50% COMMON:"                        
[2] "LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 60-70% (0-49)LESS THAN 50% COMMON:"                        
[3] "LEFT CAROTID STENOSIS: (40-60)APPROXIMATELY 40% INCOMPLETE SCAN SEE NOTES (40-    50)LESS THAN 50% COMMON:"

extract sub1

d1$sub1 <- as.character(lapply((strapply(d1$sub,"((?<=LEFT CAROTID STENOSIS:).{5,}?(?=(\\(|COMMON)))", perl=TRUE)), unique))
d1$sub1
[1] " (50-69)APPROXIMATELY 50-55% "                       
[2] " (50-69)APPROXIMATELY 60-70% "                       
[3] " (40-60)APPROXIMATELY 40% INCOMPLETE SCAN SEE NOTES "

Now reference sub1 to get sub2 from the data

Want to return "(0-49)LESS THAN 50%", "(0-49)LESS THAN 50%", And "(40-50)LESS THAN 50%"

d1$sub2 <- as.character(lapply((strapply(d1$sub,"((?<=\\d1$sub1).*?(?=COMMON))", perl=TRUE)), unique))
d1$sub2
[1] "NULL" "NULL" "NULL"

* Original Post Below **

I am extracting medical information from text reports, and am attempting to use one variable ($sub1) as part of a regex to find the next variable ($sub2).

How can you use a value from a variable to regex the next variable?

library(plyr)
library(tm)
library(stringr)
library(gsubfn)

#Dataset of velocities
d1 <- c("CCA: 135 cm/sec ICA: 50 cm/sec", "CCA: 150 cm/sec ICA: 75 cm/sec")
d1
[1] "CCA: 135 cm/sec ICA: 50 cm/sec" "CCA: 150 cm/sec ICA: 75 cm/sec"

#Lookahead to get sub1
d1$sub1 <- as.character(lapply((strapply(d1,"(.*?(?=ICA:))", perl=TRUE)), unique))
Warning message:
In d1$sub1 <- as.character(lapply((strapply(d1, "(.*?(?=ICA:))",  :
 Coercing LHS to a list
d1
[[1]]
[1] "CCA: 135 cm/sec ICA: 50 cm/sec"

[[2]]
[1] "CCA: 150 cm/sec ICA: 75 cm/sec"

$sub1
[1] "CCA: 135 cm/sec " "CCA: 150 cm/sec "

#Now reference sub1 to get sub2 - does not work?
#Want to return "ICA:50 cm/sec" and "ICA:75 cm/sec"
#Used paste(d1$sub1) to try getting the $sub1 variable into the regex, but doesn't work)
d1$sub2 <- as.character(lapply((strapply(d1,"((?<=paste(d1$sub1)).*?)", perl=TRUE)), unique))
d1$sub2
[1] "NULL" "NULL" "NULL"

The text has structure, but is very variable in terms of length, content, etc. Defining the first variable ($sub1) is easy, but using it to define the second variable will be the most precise.

Maybe I should have emphasized that the text is very variable - so a simple regex based on the text pattern will not work. I need to use the first variable to locate the second within the text. It is medical information so I can't post actual data.

1
  • Not sure I fully understand your question - the desired final output is 2 variables ($sub1 = CA: 135 cm/sec, $sub2 =CCA: 50cm/sec). I can generate the variables but am struggling with how to reference the first to get a locator for the second. Commented Aug 21, 2013 at 15:39

3 Answers 3

6

Try using the paste0() function. That will put together all your variables and any regular expressions you want to use.

grep(paste0("^.*", variable, ".*$"), d1)

you can also add the argument collapse = "" to paste0() if your variable could have >1 element

Sign up to request clarification or add additional context in comments.

Comments

2

Try this:

> d1 <- c("CCA: 135 cm/sec ICA: 50 cm/sec", "CCA: 150 cm/sec ICA: 75 cm/sec")
> t(strapplyc(d1, "\\w+: \\S+ \\S+", simplify = TRUE))
     [,1]              [,2]            
[1,] "CCA: 135 cm/sec" "ICA: 50 cm/sec"
[2,] "CCA: 150 cm/sec" "ICA: 75 cm/sec"

Comments

0

You'll need to escape various characters to use variables in regex, but why not do the simpler thing?

sub('(.*)ICA.*', '\\1', d1)
#[1] "CCA: 135 cm/sec " "CCA: 150 cm/sec "
sub('.*(ICA.*)', '\\1', d1)
#[1] "ICA: 50 cm/sec" "ICA: 75 cm/sec"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.