How to select strings to read from file or data.frame by partial string match or regex in R?

Question

Here's the file sample:

PG32 -13475.111367   9609.545216 -20675.190735   -194.319140                    
PG04 -15764.275182  19616.036013  -8378.361758     -9.567460                    
PG08 -23862.812721   9840.809904  -4415.011886     18.783955                    
PG10  25009.053940   9106.541565   2672.535304   -168.226094                    
PG14 -14188.519147  -9647.162991 -20079.808927     76.323202                    
PG13  12541.368512 -14252.727697  18475.956052    -99.144840                    
PG28  22638.858335  13831.226799   2650.716670    427.905209                    
PG21 -10609.714398 -12191.750707  21782.583544   -429.224611                    
PG11  -8677.979931  23944.136240  -7811.280190   -566.272355                    
PG22 -24991.333186  -9073.717145  -1692.043749    331.646741                    
PG20  25603.243214   5007.836647   5172.462172    302.625348                    
PG18 -19417.534666 -15923.466357   9597.721199    388.425996

It's actually times bigger. First column is a satellite's "name" (e.g. "PG32"). I have a character vector with sats ids:

>[1] "PG05" "PG07" "PG09" "PG10" "PG13" "PG16" "PG19" "PG20" "PG27"  "PG28" "PG30"

So I need to extract only the lines with those ids either from a data.frame or from a file using gsubfn package read.pattern. I'm trying to get into regular expressions but don't have a complete understanding of the subject yet.

Try yourdf[grep(paste(v1, collapse='|'), yourdf$firstcolumn),] — akrun
– akrun, Commented Dec 27, 2015 at 13:04
Thanks, seems good for a data.frame. I'd like to know how to get the same result without dumping entire file to data.frame. It seems read.pattern allows to read lines from file based on regexp and that's what I want to do here. But I can't figure the appropriate regexp. — ephemeris
– ephemeris, Commented Dec 27, 2015 at 14:31

Parfait · Accepted Answer · 2015-12-27 21:39:35Z

Consider scanning the file line by line with scan, iteratively checking if first column is in the satellite list:

## INITIAL VARS
file <- "C:\\Path\\To\\File.txt"
flines <- 12

satnames <- c("PG05", "PG07", "PG09", "PG10", "PG13", "PG16", 
              "PG19", "PG20", "PG27", "PG28", "PG30", "PG32")

## OPEN CONNECTION
con <- file(description=file, open="r")

## LOOP OVER CONNECTION
dfList <- c()
for(i in 1:flines) {
  tmp <- scan(file=con, nlines=1, what = list("","","","",""), quiet=TRUE)
  names(tmp) <- c('sat', 'data1', 'data2', 'data3', 'data4')  

  # APPEND TO DFLIST ONLY IF IN SATNAMES LIST
  if (tmp$sat %in% satnames) {
    dfList <- c(dfList, list(tmp))   
  }      
}

# CLOSE CONNECTION
unlink(tmp)
close(con)

# MIGRATE LIST TO DATA FRAME, CONVERTING DATA TYPES
df <- as.data.frame(do.call(rbind, dfList))
df[,c(2:5)] <- sapply(df[,(2:5)], function(x) as.numeric(as.character(x)))

rm(con, dfList, tmp)

Collectives™ on Stack Overflow

How to select strings to read from file or data.frame by partial string match or regex in R?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related