3

Edit: Thanks to those who have responded so far; I'm very much a beginner in R and have just taken on a large project for my MSc dissertation so am a bit overwhelmed with the initial processing. The data I'm using is as follows (from WMO publically available rainfall data):


120 6272100 KHARTOUM 15.60 32.55 382 1899 1989 0.0
1899 0.03 0.03 0.03 0.03 0.03 1.03 13.03 12.03 9999 6.03 0.03 0.03
1900 0.03 0.03 0.03 0.03 0.03 23.03 80.03 47.03 23.03 8.03 0.03 0.03
1901 0.03 0.03 0.03 0.03 0.03 17.03 23.03 17.03 0.03 8.03 0.03 0.03
(...)
120 6272101 JEBEL AULIA 15.20 32.50 380 1920 1988 0.0
1920 0.03 0.03 0.03 0.00 0.03 6.90 20.00 108.80 47.30 1.00 0.01 0.03
1921 0.03 0.03 0.03 0.00 0.03 0.00 88.00 57.00 35.00 18.50 0.01 0.03
1922 0.03 0.03 0.03 0.00 0.03 0.00 87.50 102.30 10.40 15.20 0.01 0.03
(...)

There are ~100 observation stations that I'm interested in, each of which has a varying start and end date for rainfall measurements. They're formatted as above in a single data file, with stations separated by "120 (station number) (station name)".

I need first to separate this file by station, then to extract March, April, May and June for each year, then take a total of these months for each year. So far I'm messing around with loops (as below), but I understand this isn't the right way to go about it and would rather learn some better technique. Thanks again for the help!

(Original question:) I've got a large data set containing rainfall by season for ~100 years over 100+ locations. I'm trying to separate this data into more managable arrays, and in particular I want to retrieve the sum of the rainfall for March, April, May and June for each station for each year. The following is a simplified version of my code so far:

a <- array(1,dim=c(10,12))
for (i in 1:5) {

  all data:
  assign(paste("station_",i,sep=""), a)

  #march - june data:
  assign(paste("station_",i,"_mamj",sep=""), a[,4:7])
}

So this gives me station_(i)__mamj_ which contains the data for the months I'm interested in for each station. Now I want to sum each row of this array and enter it in a new array called station_(i)_mamj_tot. Simple enough in theory, but I can't work out how to reference station_(i)_mamj so that it varies the value of i with each iteration. Any help much appreciated!

8
  • 2
    This is not a reproducible example as you haven't provided any data. May I suggest you check out this LINK on making a reproducible example. Commented May 14, 2012 at 17:18
  • As mentioned you should probably really be using lists to solve this problem. If we knew what your data looked like we could probably help you even more. For example from what I can gather what you want to do would probably be more cleanly done using split and lapply. Commented May 14, 2012 at 18:05
  • If you can make a data frame in a way that each column represents a month of a particular year, then you can just get the sum with the summary(data.frame) Commented May 14, 2012 at 18:22
  • @Subs, he wants to aggregate by year, but only calculate mamj totals. This is a job for Split-Apply-Combine! See my ddply one-liner below! Commented May 14, 2012 at 22:43
  • OP, avoid using loops whenever you can vectorize, that's the power of R. With your permission I would like to retag from 'variables','loops','concatenation' to 'vectorization','loops','plyr'? Commented May 14, 2012 at 22:46

3 Answers 3

4

This is totally begging for a dataframe, then it's just this one-liner with power-tools like ddply (amazingly powerful):

tot_mamj <- ddply(rain[rain$month %in% 3:6,-2], 'year', colwise(sum))

giving your aggregate of total for M/A/M/J, by year:

   year station_1 station_2 station_3 station_4 station_5 ...
1  1972  8.618960  5.697739 10.083192  9.264512 11.152378 ...
2  1973 18.571748 18.903280 11.832462 18.262272 10.509621 ...
3  1974 22.415201 22.670821 32.850745 31.634717 20.523778 ...
4  1975 16.773286 17.683704 18.259066 14.996550 19.007762 ...
...

Below is perfectly working code. We create a dataframe whose col.names are 'station_n'; also extra columns for year and month (factor, or else integer if you're lazy, see the footnote). Now you can do arbitrary analysis by month or year (using plyr's split-apply-combine paradigm):

require(plyr) # for d*ply, summarise
#require(reshape) # for melt

# Parameterize everything here, it's crucial for testing/debugging
all_years <- c(1970:2011)
nYears <- length(all_years)  
nStations <- 101
# We want station names as vector of chr (as opposed to simple indices)
station_names <- paste ('station_', 1:nStations, sep='')

rain <- data.frame(cbind(
  year=rep(c(1970:2011),12),
  month=1:12
))
# Fill in NAs for all data
rain[,station_names] <- as.numeric(NA)
# Make 'month' a factor, to prevent any numerical funny stuff e.g accidentally 'aggregating' it
rain$month <- factor(rain$month)

# For convenience, store the row indices for all years, M/A/M/J
I.mamj <- which(rain$month %in% 3:6)

# Insert made-up seasonal data for M/A/M/J for testing... leave everything else NA intentionally
rain[I.mamj,station_names] <- c(3,5,9,6) * runif(4*nYears*nStations)

# Get our aggregate of MAMJ totals, by year
# The '-2' column index means: "exclude month, to prevent it also getting 'aggregated'"
excludeMonthCol = -2
tot_mamj <- ddply(rain[rain$month %in% 3:6, excludeMonthCol], 'year', colwise(sum))

# voila!!
#    year station_1 station_2 station_3 station_4 station_5
# 1  1972  8.618960  5.697739 10.083192  9.264512 11.152378
# 2  1973 18.571748 18.903280 11.832462 18.262272 10.509621
# 3  1974 22.415201 22.670821 32.850745 31.634717 20.523778
# 4  1975 16.773286 17.683704 18.259066 14.996550 19.007762

As a footnote, before I converted month from numeric to factor, it was getting silently 'aggregated' (until I put in the '-2': exclude column reference). However, better still is when you make it a factor, it will refuse point-blank to be aggregate'd, and throw an error (which is desirable for debugging):

 ddply(rain[rain$month %in% 3:6, ], 'year', colwise(sum))
Error in Summary.factor(c(3L, 3L, 3L, 3L, 3L, 3L), na.rm = FALSE) : 
  sum not meaningful for factors
Sign up to request clarification or add additional context in comments.

Comments

2

For your original question, use get():

i <- 10
var <- paste("test", i, sep="_")
assign(10, var)
get(var)

As David said, this is probably not the best path to be taking, but it can be useful at times (and IMO the assign/get construct is far better than eval(parse))

4 Comments

True with regard to get. However, when all you're doing is saving one variable per index, it is very silly not to use the built in types such as lists and matrices.
Thus "As David said, this is probably not the best path to be taking" :)
data.frame has the advantage over array that we can use heterogeneous columns, so we can have columns for both 'year' and 'month' as factors... then we can arbitrarily split-apply-combine by year, month, or any arbitrary subset thereof. Whereas array is really pretty limited for data analysis.
The thing is that he (originally) asked how to get values out of arbitrary variable names, not how he should be structuring his question. As I stated in my post, it is pretty clear that he wasn't going about things the right way (See Lumley's fortune() quip about eval(parse())). Answering the (original) Q with data.frames and such does not actually answer the (original) Q.
1

Why are you using assign to create variables like station1, station2, station_3_mamj and so on? It would be much easier and more intuitive to store them in a list, like stations[[1]], stations[[2]], stations_mamj[[3]], and such. Then each could be accessed using their index.

Since it looks like each piece of per-station data you're working with is a matrix of the same size, you could even deal with them as a three-dimensional matrix.

ETA: Incidentally, if you really want to solve the problem this way, you would do:

eval(parse(text=paste("station", i, "mamj", sep="_")))

But don't- using eval is almost always bad practices, and will make it difficult to do even simple operations on your data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.