3

I have two data.frames, (df1, df2) and I would like to replace the values in columns P1-P10 the letters with the values of df1$V2 but keeping the first two columns of df2.

df1 = data.frame(V1=LETTERS, V2=rnorm(26))

df2 <- data.frame(Name=sample(LETTERS, 6), bd=sample(1:6), P1=sample(LETTERS,6), P2=sample(LETTERS, 6), P3=sample(LETTERS, 6), P4=sample(LETTERS, 6), P5=sample(LETTERS, 6), P6=sample(LETTERS, 6), P7=sample(LETTERS, 6), P8=sample(LETTERS, 6), P9=sample(LETTERS, 6), P10=sample(LETTERS, 6))

My approach is the following:

df3 <- matrix(setNames(df1[,2], df1[,1])[as.character(unlist(df2[,3:12]))], nrow=6, ncol=10)
df4 <- data.frame(cbind(df2[,1:2], df3))

Which gives me my desire output, my real data has 10,000 columns, is there any way to avoid the cbind step or make the process fast?

> df4
Name bd         X1          X2         X3         X4         X5         X6        X7         X8         X9        X10
1    V  6 -1.8991102  0.40269050 -0.1517500 -2.5297829  1.5315622  1.4897071  1.364071 -1.2443708 -1.3197276 -0.4917057
2    T  1 -2.5297829 -0.44614123 -0.1894970 -0.6693774 -0.1517500 -1.0650962 -0.151750 -0.4461412 -0.6693774 -1.1351770
3    R  5 -0.6693774  0.09059365 -2.5297829  0.3233827 -0.9383348 -0.4461412  1.281797  1.5315622  1.4897071 -0.4461412
4    B  4 -0.4461412 -0.93833476 -1.2443708 -0.4461412 -0.1894970 -0.9383348 -1.135177 -1.8991102 -0.1894970  0.4026905
5    K  2 -1.0180271 -1.06509624 -0.1939600 -0.1894970  1.4897071 -0.6693774 -1.899110 -1.3197276  1.5315622 -0.1517500
6    Y  3  1.5315622 -0.19396005 -0.4917057 -0.4664239 -1.8991102  0.4026905 -1.065096 -0.9383348 -1.2443708 -0.4664239

Thanks

3
  • In your example P1-P10 are factors. Is it really so in your dataset? Commented Dec 11, 2013 at 20:33
  • yes, they are factors in my dataset,sorry for my late reply Commented Dec 11, 2013 at 22:57
  • OK, my answer now works with factors as well as characters. Commented Dec 12, 2013 at 7:34

2 Answers 2

3

You can match the values of df2[3:12] in df1[[1]]. These row numbers are used to extract the values from df1[2].

df2[3:12] <- df1[match(as.character(unlist(df2[3:12])), 
                       as.character(df1[[1]])), 2]

The result (df2):

  Name bd         P1         P2         P3         P4         P5         P6         P7         P8         P9        P10
1    H  5  0.1199355  0.3752010 -0.3926061 -1.1039548 -0.1107821  0.9867373 -0.3360094 -0.7488000 -0.3926061  2.0667704
2    U  4  0.1168599  0.1168599  0.9867373  1.3521418  0.9867373 -0.3360094 -0.7724007 -0.3926061 -0.3360094 -1.2543480
3    R  3 -1.2337890 -0.1107821 -0.7724007  2.0667704  0.3752010  0.4645504  0.9867373  0.1168599 -0.0981773 -0.3926061
4    G  2 -0.3926061  0.3199261 -0.0981773 -0.1107821  2.0667704 -1.1039548 -1.2337890  0.3199261 -1.2337890 -2.1534678
5    C  6 -2.1534678 -1.1039548 -1.1039548 -0.7488000  0.4645504  0.3199261 -2.1534678 -0.3360094  0.9867373  0.8771467
6    I  1  0.6171634  0.6224091  1.8011711  0.7292998  0.8771467  2.0667704  0.3752010  0.4645504 -2.1534678 -0.7724007

If you don't want to replace the values inside df2, you can create a new data frame df4 with

df4 <- "[<-"(df2, 3:12, value = df1[match(as.character(unlist(df2[3:12])), 
                                          as.character(df1[[1]])), 2])
Sign up to request clarification or add additional context in comments.

10 Comments

Hello again @Sven Hohensein, something weird is happening with my data, I have converted my data.frame to a matrix for speed the calculations (7 rows, 10,000 columns). When I apply your solution, the values are replaced only until a certain number of columns, for example, from 1220 the values are not replaced keeping the "character" value. The dput of my data is too big for posting here but I'd like to ask for your help. Many thanks
@user2380782 In my code, 3:12 represents the numbers of the columns of interest. Maybe you should try with 3:10000 instead.
@user2380782 Did you replace both 3:12 with 3:dim(df)[2]?
Yes, I did. It is a little weird... I'll try to figure out what is happening
Hello @Sven Hohenstein, the problem was that when I converted the data.frame to a matrix I forgot that the structure is different, so I should add df1[match(as.character(unlist(df2[,3:12]). A , makes a big difference...
|
0

Try some *pply magic:

lookup<-tapply(df1$V2, df1$V1, unique) #Creates a lookup table
lookup.function<-function(x) as.numeric(lookup[as.character(x)]) #The function
df4<-data.frame(df2[,1:2], apply(df2[,3:12], 2,lookup.function )) #Builds the output

Update:

The *pply family is much faster than merge, at least an order of magnitude. Check this out

num<-1000
df1 = data.frame(V1=LETTERS, V2=rnorm(26))
df2<-data.frame(cbind(first=1:num,second=1:num, matrix(sample(LETTERS, num^2, replace=T), nrow=num, ncol=num)))


start<-Sys.time()
lookup<-tapply(df1$V2, df1$V1, unique)
lookup.function<-function(x) as.numeric(lookup[as.character(x)])
df4<-data.frame(cbind(df2[,1:2], data.frame(apply(df2[,3:(num+2)], 2, lookup.function ))))
(difftime(Sys.time(),start))


start<-Sys.time()
df4.merge <- "[<-"(df2, 3:num, value = df1[match(as.character(unlist(df2[3:num])), as.character(df1[[1]])), 2])
(difftime(Sys.time(),start))

sum(df4==df4.merge)==num^2

For 3000 columns and rows the *pply combination needs 4.3s whereas merge needs about 22s on my slow Intel. And it scales nicely. For 4000 columns and rows the respective times are 7.4 sec and 118 sec.

6 Comments

If I converted the data.frames, (now I have a list of data.frames that I would like to match against a single data.frame) to a matrix class for speeding calculations, could I adapt your approach? Thanks @ECII
Why do you have to convert anything? Whats wrong with my approach? You have to give us reproducible examples
I've converted the data.frame to a matrix because it comes from a previous step involving sampling in my list of data.frames and every iteration with lapply is slow in data.frames compared to matrices. I have not posted my problem including the list of data.frames because I thought it would be more understandable just summarising a list of data.frames into a data.frame but it was a mistake. But there is nothing wrong with your approach, just asking if I could implement it on matrices. Thanks for your quick reply
You say you have data frames of 6 rows and 10.000 columns. My answer should take less than a second to match. What do you need more?
hi @ECII your approach is really fast, and it worked really well. However I am trying to apply it for running in a list a matrices against a data.frame and it takes so long, could be because the lapply? thanks
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.