replace some column values from a data.frame based on another data.frame

Question

I have two data.frames, (df1, df2) and I would like to replace the values in columns P1-P10 the letters with the values of df1$V2 but keeping the first two columns of df2.

df1 = data.frame(V1=LETTERS, V2=rnorm(26))

df2 <- data.frame(Name=sample(LETTERS, 6), bd=sample(1:6), P1=sample(LETTERS,6), P2=sample(LETTERS, 6), P3=sample(LETTERS, 6), P4=sample(LETTERS, 6), P5=sample(LETTERS, 6), P6=sample(LETTERS, 6), P7=sample(LETTERS, 6), P8=sample(LETTERS, 6), P9=sample(LETTERS, 6), P10=sample(LETTERS, 6))

My approach is the following:

df3 <- matrix(setNames(df1[,2], df1[,1])[as.character(unlist(df2[,3:12]))], nrow=6, ncol=10)
df4 <- data.frame(cbind(df2[,1:2], df3))

Which gives me my desire output, my real data has 10,000 columns, is there any way to avoid the cbind step or make the process fast?

> df4
Name bd         X1          X2         X3         X4         X5         X6        X7         X8         X9        X10
1    V  6 -1.8991102  0.40269050 -0.1517500 -2.5297829  1.5315622  1.4897071  1.364071 -1.2443708 -1.3197276 -0.4917057
2    T  1 -2.5297829 -0.44614123 -0.1894970 -0.6693774 -0.1517500 -1.0650962 -0.151750 -0.4461412 -0.6693774 -1.1351770
3    R  5 -0.6693774  0.09059365 -2.5297829  0.3233827 -0.9383348 -0.4461412  1.281797  1.5315622  1.4897071 -0.4461412
4    B  4 -0.4461412 -0.93833476 -1.2443708 -0.4461412 -0.1894970 -0.9383348 -1.135177 -1.8991102 -0.1894970  0.4026905
5    K  2 -1.0180271 -1.06509624 -0.1939600 -0.1894970  1.4897071 -0.6693774 -1.899110 -1.3197276  1.5315622 -0.1517500
6    Y  3  1.5315622 -0.19396005 -0.4917057 -0.4664239 -1.8991102  0.4026905 -1.065096 -0.9383348 -1.2443708 -0.4664239

Thanks

In your example P1-P10 are factors. Is it really so in your dataset? — ECII
– ECII, Commented Dec 11, 2013 at 20:33

Sven Hohenstein · Accepted Answer · 2013-12-11 19:54:57Z

3

You can match the values of df2[3:12] in df1[[1]]. These row numbers are used to extract the values from df1[2].

df2[3:12] <- df1[match(as.character(unlist(df2[3:12])), 
                       as.character(df1[[1]])), 2]

The result (df2):

  Name bd         P1         P2         P3         P4         P5         P6         P7         P8         P9        P10
1    H  5  0.1199355  0.3752010 -0.3926061 -1.1039548 -0.1107821  0.9867373 -0.3360094 -0.7488000 -0.3926061  2.0667704
2    U  4  0.1168599  0.1168599  0.9867373  1.3521418  0.9867373 -0.3360094 -0.7724007 -0.3926061 -0.3360094 -1.2543480
3    R  3 -1.2337890 -0.1107821 -0.7724007  2.0667704  0.3752010  0.4645504  0.9867373  0.1168599 -0.0981773 -0.3926061
4    G  2 -0.3926061  0.3199261 -0.0981773 -0.1107821  2.0667704 -1.1039548 -1.2337890  0.3199261 -1.2337890 -2.1534678
5    C  6 -2.1534678 -1.1039548 -1.1039548 -0.7488000  0.4645504  0.3199261 -2.1534678 -0.3360094  0.9867373  0.8771467
6    I  1  0.6171634  0.6224091  1.8011711  0.7292998  0.8771467  2.0667704  0.3752010  0.4645504 -2.1534678 -0.7724007

If you don't want to replace the values inside df2, you can create a new data frame df4 with

df4 <- "[<-"(df2, 3:12, value = df1[match(as.character(unlist(df2[3:12])), 
                                          as.character(df1[[1]])), 2])

edited Dec 11, 2013 at 19:54

answered Dec 11, 2013 at 19:48

Sven Hohenstein

82k17 gold badges150 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

user2380782 Over a year ago

Hello again @Sven Hohensein, something weird is happening with my data, I have converted my data.frame to a matrix for speed the calculations (7 rows, 10,000 columns). When I apply your solution, the values are replaced only until a certain number of columns, for example, from 1220 the values are not replaced keeping the "character" value. The dput of my data is too big for posting here but I'd like to ask for your help. Many thanks

Sven Hohenstein Over a year ago

@user2380782 In my code, 3:12 represents the numbers of the columns of interest. Maybe you should try with 3:10000 instead.

Sven Hohenstein Over a year ago

@user2380782 Did you replace both 3:12 with 3:dim(df)[2]?

user2380782 Over a year ago

Yes, I did. It is a little weird... I'll try to figure out what is happening

user2380782 Over a year ago

Hello @Sven Hohenstein, the problem was that when I converted the data.frame to a matrix I forgot that the structure is different, so I should add df1[match(as.character(unlist(df2[,3:12]). A , makes a big difference...

|

ECII · Accepted Answer · 2013-12-12 07:34:02Z

0

Try some *pply magic:

lookup<-tapply(df1$V2, df1$V1, unique) #Creates a lookup table
lookup.function<-function(x) as.numeric(lookup[as.character(x)]) #The function
df4<-data.frame(df2[,1:2], apply(df2[,3:12], 2,lookup.function )) #Builds the output

Update:

The *pply family is much faster than merge, at least an order of magnitude. Check this out

num<-1000
df1 = data.frame(V1=LETTERS, V2=rnorm(26))
df2<-data.frame(cbind(first=1:num,second=1:num, matrix(sample(LETTERS, num^2, replace=T), nrow=num, ncol=num)))


start<-Sys.time()
lookup<-tapply(df1$V2, df1$V1, unique)
lookup.function<-function(x) as.numeric(lookup[as.character(x)])
df4<-data.frame(cbind(df2[,1:2], data.frame(apply(df2[,3:(num+2)], 2, lookup.function ))))
(difftime(Sys.time(),start))


start<-Sys.time()
df4.merge <- "[<-"(df2, 3:num, value = df1[match(as.character(unlist(df2[3:num])), as.character(df1[[1]])), 2])
(difftime(Sys.time(),start))

sum(df4==df4.merge)==num^2

For 3000 columns and rows the *pply combination needs 4.3s whereas merge needs about 22s on my slow Intel. And it scales nicely. For 4000 columns and rows the respective times are 7.4 sec and 118 sec.

edited Dec 12, 2013 at 7:34

answered Dec 11, 2013 at 20:50

ECII

10.7k18 gold badges88 silver badges128 bronze badges

6 Comments

user2380782 Over a year ago

If I converted the data.frames, (now I have a list of data.frames that I would like to match against a single data.frame) to a matrix class for speeding calculations, could I adapt your approach? Thanks @ECII

ECII Over a year ago

Why do you have to convert anything? Whats wrong with my approach? You have to give us reproducible examples

user2380782 Over a year ago

I've converted the data.frame to a matrix because it comes from a previous step involving sampling in my list of data.frames and every iteration with lapply is slow in data.frames compared to matrices. I have not posted my problem including the list of data.frames because I thought it would be more understandable just summarising a list of data.frames into a data.frame but it was a mistake. But there is nothing wrong with your approach, just asking if I could implement it on matrices. Thanks for your quick reply

ECII Over a year ago

You say you have data frames of 6 rows and 10.000 columns. My answer should take less than a second to match. What do you need more?

user2380782 Over a year ago

hi @ECII your approach is really fast, and it worked really well. However I am trying to apply it for running in a list a matrices against a data.frame and it takes so long, could be because the lapply? thanks

|

Collectives™ on Stack Overflow

replace some column values from a data.frame based on another data.frame

2 Answers 2

10 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related