For loop to mutate multiple columns

Question

I have a tibble songs which is too big to share here. Also, it doesn't matter; the problem applies for any tibble that only has dbl values.

The idea is that I have one row I selected before. It can be any one of them, without any previous knowledge. The first thing I did was to filter it out:

songs2 <- songs %>%
  anti_join(choice)

This works.

By the way, choice has a single row.

Now, I create a second tibble (third, but second in this post) called dist, which only has dbl values (and therefore shares columns with choice). I want to subtract the values in choice from each row in dist.

I tried writting this:

for (i in seq_along(distUseful)) {
  dist <- dist %>%
    mutate_(distUseful[i] = (.data[[i]] - choice[[i]]))
}

But it doesn't work:

> for (i in seq_along(distUseful)) {
+   dist <- dist %>%
+     mutate_(distUseful[i] = (.data[[i]] - choice[[i]]))
Error: unexpected '=' in:
"  dist <- dist %>%
    mutate_(distUseful[i] ="
> }
Error: unexpected '}' in "}"

EDIT: This is the first 10 rows in songs2, as requested in the comments.

structure(list(acousticness = c(0.991, 0.643, 0.993, 0.000173, 
0.295, 0.996, 0.992, 0.996, 0.996, 0.00682), artists = c("['Mamie Smith']", 
"[\"Screamin' Jay Hawkins\"]", "['Mamie Smith']", "['Oscar Velazquez']", 
"['Mixe']", "['Mamie Smith & Her Jazz Hounds']", "['Mamie Smith']", 
"['Mamie Smith & Her Jazz Hounds']", "['Francisco Canaro']", 
"['Meetya']"), danceability = c(0.598, 0.852, 0.647, 0.73, 0.704, 
0.424, 0.782, 0.474, 0.469, 0.571), duration_ms = c(168333, 150200, 
163827, 422087, 165224, 198627, 195200, 186173, 146840, 476304
), energy = c(0.224, 0.517, 0.186, 0.798, 0.707, 0.245, 0.0573, 
0.239, 0.238, 0.753), explicit = c(FALSE, FALSE, FALSE, FALSE, 
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE), id = c("0cS0A1fUEUd1EW3FcF8AEI", 
"0hbkKFIJm7Z05H8Zl9w30f", "11m7laMUgmOKqI3oYzuhne", "19Lc5SfJJ5O1oaxY0fpwfh", 
"2hJjbsLCytGsnAHfdsLejp", "3HnrHGLE9u2MjHtdobfWl9", "5DlCyqLyX2AOVDTjjkDZ8x", 
"02FzJbHtqElixxCmrpSCUa", "02i59gYdjlhBmbbWhf8YuK", "06NUxS2XL3efRh0bloxkHm"
), instrumentalness = c(0.000522, 0.0264, 1.76e-05, 0.801, 0.000246, 
0.799, 1.61e-06, 0.186, 0.96, 0.873), key = c(5, 5, 0, 2, 10, 
5, 5, 9, 8, 8), liveness = c(0.379, 0.0809, 0.519, 0.128, 0.402, 
0.235, 0.176, 0.195, 0.149, 0.092), loudness = c(-12.628, -7.261, 
-12.098, -7.311, -6.036, -11.47, -12.453, -9.712, -18.717, -6.943
), mode = c(0, 0, 1, 1, 0, 1, 1, 1, 1, 1), name = c("Keep A Song In Your Soul", 
"I Put A Spell On You", "Golfing Papa", "True House Music - Xavier Santos & Carlos Gomix Remix", 
"Xuniverxe", "Crazy Blues - 78rpm Version", "Don't You Advertise Your Man", 
"Arkansas Blues", "La Chacarera - Remasterizado", "Broken Puppet - Original Mix"
), popularity = c(12, 7, 4, 17, 2, 9, 5, 0, 0, 0), release_date = c("1920", 
"1920-01-05", "1920", "1920-01-01", "1920-10-01", "1920", "1920", 
"1920", "1920-07-08", "1920-01-01"), speechiness = c(0.0936, 
0.0534, 0.174, 0.0425, 0.0768, 0.0397, 0.0592, 0.0289, 0.0741, 
0.0446), tempo = c(149.976, 86.889, 97.6, 127.997, 122.076, 103.87, 
85.652, 78.784, 130.06, 126.993), valence = c(0.634, 0.95, 0.689, 
0.0422, 0.299, 0.477, 0.487, 0.366, 0.621, 0.119), year = c(1920, 
1920, 1920, 1920, 1920, 1920, 1920, 1920, 1920, 1920)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

This is choice:

structure(list(acousticness = 0.511, danceability = 0.403, duration_ms = 117395, 
    instrumentalness = 0.896, liveness = 0.108, loudness = -8.126, 
    popularity = 65, speechiness = 0.0514, tempo = 135.047, valence = 0.192), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"))

And finally:

distUseful <- c("acousticness", "danceability", "duration_ms", "duration_ms", "instrumentalness", "liveness", "loudness", "popularity", "speechiness", "tempo", "valence")

EDIT 2: Just an afterthought: if you take the loop I cited earlier and see how it would work for a single iteration (you choose the variable), it works. In fact,the problem lies in the first part, distUseful[i] =, as per the error messages and by playing with the code.

EDIT 3: As an example, here's what happens if this is done only to the first column (so the first one is correct and the rest didn't change):

> dist %>%
+     mutate(acousticness = (dist[[1]] - choice[[1]]))
# A tibble: 174,388 x 10
   acousticness danceability duration_ms instrumentalness liveness loudness popularity speechiness tempo valence
          <dbl>        <dbl>       <dbl>            <dbl>    <dbl>    <dbl>      <dbl>       <dbl> <dbl>   <dbl>
 1        0.48         0.598      168333       0.000522     0.379    -12.6          12      0.0936 150.   0.634 
 2        0.132        0.852      150200       0.0264       0.0809    -7.26          7      0.0534  86.9  0.95  
 3        0.482        0.647      163827       0.0000176    0.519    -12.1           4      0.174   97.6  0.689 
 4       -0.511        0.73       422087       0.801        0.128     -7.31         17      0.0425 128.   0.0422
 5       -0.216        0.704      165224       0.000246     0.402     -6.04          2      0.0768 122.   0.299 
 6        0.485        0.424      198627       0.799        0.235    -11.5           9      0.0397 104.   0.477 
 7        0.481        0.782      195200       0.00000161   0.176    -12.5           5      0.0592  85.7  0.487 
 8        0.485        0.474      186173       0.186        0.195     -9.71          0      0.0289  78.8  0.366 
 9        0.485        0.469      146840       0.96         0.149    -18.7           0      0.0741 130.   0.621 
10       -0.504        0.571      476304       0.873        0.092     -6.94          0      0.0446 127.   0.119

You should share some data so we could reproduce your problem. Even though you believe we should abstract the whole thing and approach your issue conceptually, it would be much easier to help you with some data. It doesn't have to be your whole dataframe. Something like pasting dput(head(data_frame_name, 10)) would help. — GuedesBF
– GuedesBF, Commented Apr 2, 2021 at 12:19
Yes, precisely. I think I'll add an "expected output" in there. — Érico Patto
– Érico Patto, Commented Apr 2, 2021 at 12:58
You most certainly would do better with dplyr than with this for loop approach here — GuedesBF
– GuedesBF, Commented Apr 2, 2021 at 13:15

Necroticka · Accepted Answer · 2021-04-02 13:14:38Z

2

Assuming that dist is a tibble and choice is a vector of values (whose length is equal to the number of columns in dist), I would try something like this:

amend_row <- function(amend_vals, ...) {
   ... - amend_vals
}

purrr::pmap(dist, ~ amend_row(amend_vals = choice, .)) %>%
   do.call(what = rbind, args = .) %>%
   as_tibble() %>% 
   purrr::set_names(nm = colnames(dist))

answered Apr 2, 2021 at 13:14

Necroticka

2851 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Érico Patto Over a year ago

Actually, choice is a tibble, but you're right, I didn't hink to use purrr...

Érico Patto Over a year ago

Question: what does set_names do here?

Érico Patto Over a year ago

And ironically, this seems to work perfectly for the first column, but the rest gives me crazy numbers... Maybe it's because choice is a tibble and not a vector?

Érico Patto Over a year ago

Even if choice is a vector, it still doesn't work. I tested it by turning choice into a vector with as.numeric()...

GuedesBF · Accepted Answer · 2021-04-10 15:53:59Z

1

I had some difficulties because I think names(choice) and distUsefull do not match entirely.

I reasigned names(choice) to distUsefull before the loop:

distUseful<-names(choice)
dist<-df[distUseful]

Then, solution with a for loop

for (i in 1:nrow(dist)){
        for (j in seq_along(distUseful)){
                dist[i,j]<-dist[i,j]-choice[1,j]
        }
}

This subtracted the values as requested.

dist
# A tibble: 10 x 10
   acousticness danceability duration_ms instrumentalness liveness loudness popularity speechiness  tempo valence
          <dbl>        <dbl>       <dbl>            <dbl>    <dbl>    <dbl>      <dbl>       <dbl>  <dbl>   <dbl>
 1        0.48        0.195        50938          -0.895    0.271    -4.50         -53     0.0422   14.9    0.442
 2        0.132       0.449        32805          -0.870   -0.0271    0.865        -58     0.002   -48.2    0.758
 3        0.482       0.244        46432          -0.896    0.411    -3.97         -61     0.123   -37.4    0.497
 4       -0.511       0.327       304692          -0.0950   0.02      0.815        -48    -0.00890  -7.05  -0.150
 5       -0.216       0.301        47829          -0.896    0.294     2.09         -63     0.0254  -13.0    0.107
 6        0.485       0.0210       81232          -0.0970   0.127    -3.34         -56    -0.0117  -31.2    0.285
 7        0.481       0.379        77805          -0.896    0.0680   -4.33         -60     0.0078  -49.4    0.295
 8        0.485       0.0710       68778          -0.71     0.087    -1.59         -65    -0.0225  -56.3    0.174
 9        0.485       0.0660       29445           0.0640   0.0410  -10.6          -65     0.0227   -4.99   0.429
10       -0.504       0.168       358909          -0.023   -0.016     1.18         -65    -0.0068   -8.05  -0.073
>

For loops can be slow. In this case we have nested for loops, which could be a problem with large dataframes. A dplyr , *apply(), or data.table solution could be faster.

A faster one-line solution with mapply() (which only loops over columns, with vectorized subtraction of x and y):

data.frame(mapply(function(x,y)x-y, dist, choice))

   acousticness danceability duration_ms instrumentalness liveness loudness popularity speechiness   tempo valence
1      0.480000        0.195       50938       -0.8954780   0.2710   -4.502        -53      0.0422  14.929  0.4420
2      0.132000        0.449       32805       -0.8696000  -0.0271    0.865        -58      0.0020 -48.158  0.7580
3      0.482000        0.244       46432       -0.8959824   0.4110   -3.972        -61      0.1226 -37.447  0.4970
4     -0.510827        0.327      304692       -0.0950000   0.0200    0.815        -48     -0.0089  -7.050 -0.1498
5     -0.216000        0.301       47829       -0.8957540   0.2940    2.090        -63      0.0254 -12.971  0.1070
6      0.485000        0.021       81232       -0.0970000   0.1270   -3.344        -56     -0.0117 -31.177  0.2850
7      0.481000        0.379       77805       -0.8959984   0.0680   -4.327        -60      0.0078 -49.395  0.2950
8      0.485000        0.071       68778       -0.7100000   0.0870   -1.586        -65     -0.0225 -56.263  0.1740
9      0.485000        0.066       29445        0.0640000   0.0410  -10.591        -65      0.0227  -4.987  0.4290
10    -0.504180        0.168      358909       -0.0230000  -0.0160    1.183        -65     -0.0068  -8.054 -0.0730

edited Apr 10, 2021 at 15:53

answered Apr 2, 2021 at 14:16

GuedesBF

9,9515 gold badges23 silver badges42 bronze badges

5 Comments

Érico Patto Over a year ago

Hey, you're right! I put two copies of "duration_ms" in distUseful... Ops

Érico Patto Over a year ago

The problem is that seq_along() counts the number of elements in a vector. nrow() is just a number. Just change it to 1:nrow(dist) and choice[1,j].

Érico Patto Over a year ago

Ok, that worked, but it's also incredibly slow... Maybe there's a faster way? (Keep in mind I'm doing this with a 174.388 long tibble, as is available here)

GuedesBF Over a year ago

I found a quite simple one-liner with mapply, @ÉricoPatto. Edited my answer. This must be faster than the nested for loops.

Érico Patto Over a year ago

Hooray! That's the fastest one yet! Just took 0.027s elapsed for the entire tibble! (I just changed the data.frame to as_tibble) Hooray!

Érico Patto · Accepted Answer · 2021-04-02 15:08:58Z

Playing around with everyone's suggestions, I came up with many ideas. Only one of them worked.

I used a modified version of @Johny's function (and corrected my vector distUseful as @GuedesBF mentioned), used the suggestion not to go for a loop and came up with apply:

amend_row <- function(data) {
  data - as.numeric(choice)
}

dist %>%
  apply(X = ., FUN = amend_row, MARGIN = 1) %>%
  t() %>%
  as_tibble()

This gives me:

> dist %>%
+   apply(X = ., FUN = amend_row, MARGIN = 1) %>%
+   t() %>%
+   as_tibble()
# A tibble: 174,388 x 10
   acousticness danceability duration_ms instrumentalness liveness loudness popularity speechiness  tempo valence
          <dbl>        <dbl>       <dbl>            <dbl>    <dbl>    <dbl>      <dbl>       <dbl>  <dbl>   <dbl>
 1        0.48         0.195       50938          -0.895    0.271    -4.50         -53     0.0422   14.9    0.442
 2        0.132        0.449       32805          -0.870   -0.0271    0.865        -58     0.002   -48.2    0.758
 3        0.482        0.244       46432          -0.896    0.411    -3.97         -61     0.123   -37.4    0.497
 4       -0.511        0.327      304692          -0.0950   0.0200    0.815        -48    -0.00890  -7.05  -0.150
 5       -0.216        0.301       47829          -0.896    0.294     2.09         -63     0.0254  -13.0    0.107
 6        0.485        0.021       81232          -0.0970   0.127    -3.34         -56    -0.0117  -31.2    0.285
 7        0.481        0.379       77805          -0.896    0.068    -4.33         -60     0.0078  -49.4    0.295
 8        0.485        0.071       68778          -0.710    0.087    -1.59         -65    -0.0225  -56.3    0.174
 9        0.485        0.066       29445           0.064    0.0410  -10.6          -65     0.0227   -4.99   0.429
10       -0.504        0.168      358909          -0.0230  -0.016     1.18         -65    -0.0068   -8.05  -0.073
# … with 174,378 more rows

In a ridiculously short amount of time.

EDIT: Here is the time difference using only the first 1000 rows:

# MY SOLUTION
> dist <- songs2 %>%
+   select(all_of(distUseful)) %>%
+   head(1000)
> system.time(dist %>%
+               apply(X = ., FUN = subtraction, MARGIN = 1) %>%
+               t() %>%
+               as_tibble())
   user  system elapsed 
  0.006   0.000   0.006 
# THE FUNCTION SOLUTION – DIDN'T WORK PROPERLY (last I checked)
> amend_row <- function(amend_vals, ...) {
+   ... - amend_vals
+ }
> system.time(purrr::pmap(dist, ~ amend_row(amend_vals = choice, .)) %>%
+               do.call(what = rbind, args = .) %>%
+               as_tibble() %>% 
+               purrr::set_names(nm = colnames(dist)))
   user  system elapsed 
  1.222   0.016   1.261 
# NOT A LOT OF TIDYVERSE SOLUTION – SLOOOOOWWWWWW
> system.time(for (i in 1:nrow(dist)){
+   for (j in seq_along(distUseful)){
+     dist[i,j]<-dist[i,j]-choice[1,j]
+   }
+ })
   user  system elapsed 
  7.359   0.046   7.482

Excellent. It would be nice if you could show the differences in time you got with system.time()
The two other solutions didn't even finish. I know it's faster because it's faster when I run it with head() only and because... well, it actually finished. But I'll put the difference with the head(), good idea.

Collectives™ on Stack Overflow

For loop to mutate multiple columns

3 Answers 3

4 Comments

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related