1

I have a tibble songs which is too big to share here. Also, it doesn't matter; the problem applies for any tibble that only has dbl values.

The idea is that I have one row I selected before. It can be any one of them, without any previous knowledge. The first thing I did was to filter it out:

songs2 <- songs %>%
  anti_join(choice)

This works.

By the way, choice has a single row.

Now, I create a second tibble (third, but second in this post) called dist, which only has dbl values (and therefore shares columns with choice). I want to subtract the values in choice from each row in dist.

I tried writting this:

for (i in seq_along(distUseful)) {
  dist <- dist %>%
    mutate_(distUseful[i] = (.data[[i]] - choice[[i]]))
}

But it doesn't work:

> for (i in seq_along(distUseful)) {
+   dist <- dist %>%
+     mutate_(distUseful[i] = (.data[[i]] - choice[[i]]))
Error: unexpected '=' in:
"  dist <- dist %>%
    mutate_(distUseful[i] ="
> }
Error: unexpected '}' in "}"

EDIT: This is the first 10 rows in songs2, as requested in the comments.

structure(list(acousticness = c(0.991, 0.643, 0.993, 0.000173, 
0.295, 0.996, 0.992, 0.996, 0.996, 0.00682), artists = c("['Mamie Smith']", 
"[\"Screamin' Jay Hawkins\"]", "['Mamie Smith']", "['Oscar Velazquez']", 
"['Mixe']", "['Mamie Smith & Her Jazz Hounds']", "['Mamie Smith']", 
"['Mamie Smith & Her Jazz Hounds']", "['Francisco Canaro']", 
"['Meetya']"), danceability = c(0.598, 0.852, 0.647, 0.73, 0.704, 
0.424, 0.782, 0.474, 0.469, 0.571), duration_ms = c(168333, 150200, 
163827, 422087, 165224, 198627, 195200, 186173, 146840, 476304
), energy = c(0.224, 0.517, 0.186, 0.798, 0.707, 0.245, 0.0573, 
0.239, 0.238, 0.753), explicit = c(FALSE, FALSE, FALSE, FALSE, 
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE), id = c("0cS0A1fUEUd1EW3FcF8AEI", 
"0hbkKFIJm7Z05H8Zl9w30f", "11m7laMUgmOKqI3oYzuhne", "19Lc5SfJJ5O1oaxY0fpwfh", 
"2hJjbsLCytGsnAHfdsLejp", "3HnrHGLE9u2MjHtdobfWl9", "5DlCyqLyX2AOVDTjjkDZ8x", 
"02FzJbHtqElixxCmrpSCUa", "02i59gYdjlhBmbbWhf8YuK", "06NUxS2XL3efRh0bloxkHm"
), instrumentalness = c(0.000522, 0.0264, 1.76e-05, 0.801, 0.000246, 
0.799, 1.61e-06, 0.186, 0.96, 0.873), key = c(5, 5, 0, 2, 10, 
5, 5, 9, 8, 8), liveness = c(0.379, 0.0809, 0.519, 0.128, 0.402, 
0.235, 0.176, 0.195, 0.149, 0.092), loudness = c(-12.628, -7.261, 
-12.098, -7.311, -6.036, -11.47, -12.453, -9.712, -18.717, -6.943
), mode = c(0, 0, 1, 1, 0, 1, 1, 1, 1, 1), name = c("Keep A Song In Your Soul", 
"I Put A Spell On You", "Golfing Papa", "True House Music - Xavier Santos & Carlos Gomix Remix", 
"Xuniverxe", "Crazy Blues - 78rpm Version", "Don't You Advertise Your Man", 
"Arkansas Blues", "La Chacarera - Remasterizado", "Broken Puppet - Original Mix"
), popularity = c(12, 7, 4, 17, 2, 9, 5, 0, 0, 0), release_date = c("1920", 
"1920-01-05", "1920", "1920-01-01", "1920-10-01", "1920", "1920", 
"1920", "1920-07-08", "1920-01-01"), speechiness = c(0.0936, 
0.0534, 0.174, 0.0425, 0.0768, 0.0397, 0.0592, 0.0289, 0.0741, 
0.0446), tempo = c(149.976, 86.889, 97.6, 127.997, 122.076, 103.87, 
85.652, 78.784, 130.06, 126.993), valence = c(0.634, 0.95, 0.689, 
0.0422, 0.299, 0.477, 0.487, 0.366, 0.621, 0.119), year = c(1920, 
1920, 1920, 1920, 1920, 1920, 1920, 1920, 1920, 1920)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

This is choice:

structure(list(acousticness = 0.511, danceability = 0.403, duration_ms = 117395, 
    instrumentalness = 0.896, liveness = 0.108, loudness = -8.126, 
    popularity = 65, speechiness = 0.0514, tempo = 135.047, valence = 0.192), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"))

And finally:

distUseful <- c("acousticness", "danceability", "duration_ms", "duration_ms", "instrumentalness", "liveness", "loudness", "popularity", "speechiness", "tempo", "valence")

EDIT 2: Just an afterthought: if you take the loop I cited earlier and see how it would work for a single iteration (you choose the variable), it works. In fact,the problem lies in the first part, distUseful[i] =, as per the error messages and by playing with the code.

EDIT 3: As an example, here's what happens if this is done only to the first column (so the first one is correct and the rest didn't change):

> dist %>%
+     mutate(acousticness = (dist[[1]] - choice[[1]]))
# A tibble: 174,388 x 10
   acousticness danceability duration_ms instrumentalness liveness loudness popularity speechiness tempo valence
          <dbl>        <dbl>       <dbl>            <dbl>    <dbl>    <dbl>      <dbl>       <dbl> <dbl>   <dbl>
 1        0.48         0.598      168333       0.000522     0.379    -12.6          12      0.0936 150.   0.634 
 2        0.132        0.852      150200       0.0264       0.0809    -7.26          7      0.0534  86.9  0.95  
 3        0.482        0.647      163827       0.0000176    0.519    -12.1           4      0.174   97.6  0.689 
 4       -0.511        0.73       422087       0.801        0.128     -7.31         17      0.0425 128.   0.0422
 5       -0.216        0.704      165224       0.000246     0.402     -6.04          2      0.0768 122.   0.299 
 6        0.485        0.424      198627       0.799        0.235    -11.5           9      0.0397 104.   0.477 
 7        0.481        0.782      195200       0.00000161   0.176    -12.5           5      0.0592  85.7  0.487 
 8        0.485        0.474      186173       0.186        0.195     -9.71          0      0.0289  78.8  0.366 
 9        0.485        0.469      146840       0.96         0.149    -18.7           0      0.0741 130.   0.621 
10       -0.504        0.571      476304       0.873        0.092     -6.94          0      0.0446 127.   0.119 
8
  • You should share some data so we could reproduce your problem. Even though you believe we should abstract the whole thing and approach your issue conceptually, it would be much easier to help you with some data. It doesn't have to be your whole dataframe. Something like pasting dput(head(data_frame_name, 10)) would help. Commented Apr 2, 2021 at 12:19
  • Ok, I added songs2, choice and distUseful. Commented Apr 2, 2021 at 12:25
  • You would like to subtract choice from song2 rows here? Commented Apr 2, 2021 at 12:52
  • Yes, precisely. I think I'll add an "expected output" in there. Commented Apr 2, 2021 at 12:58
  • 1
    You most certainly would do better with dplyr than with this for loop approach here Commented Apr 2, 2021 at 13:15

3 Answers 3

2

Assuming that dist is a tibble and choice is a vector of values (whose length is equal to the number of columns in dist), I would try something like this:

amend_row <- function(amend_vals, ...) {
   ... - amend_vals
}

purrr::pmap(dist, ~ amend_row(amend_vals = choice, .)) %>%
   do.call(what = rbind, args = .) %>%
   as_tibble() %>% 
   purrr::set_names(nm = colnames(dist))
Sign up to request clarification or add additional context in comments.

4 Comments

Actually, choice is a tibble, but you're right, I didn't hink to use purrr...
Question: what does set_names do here?
And ironically, this seems to work perfectly for the first column, but the rest gives me crazy numbers... Maybe it's because choice is a tibble and not a vector?
Even if choice is a vector, it still doesn't work. I tested it by turning choice into a vector with as.numeric()...
1

I had some difficulties because I think names(choice) and distUsefull do not match entirely.

I reasigned names(choice) to distUsefull before the loop:

distUseful<-names(choice)
dist<-df[distUseful]

Then, solution with a for loop

for (i in 1:nrow(dist)){
        for (j in seq_along(distUseful)){
                dist[i,j]<-dist[i,j]-choice[1,j]
        }
}

This subtracted the values as requested.

dist
# A tibble: 10 x 10
   acousticness danceability duration_ms instrumentalness liveness loudness popularity speechiness  tempo valence
          <dbl>        <dbl>       <dbl>            <dbl>    <dbl>    <dbl>      <dbl>       <dbl>  <dbl>   <dbl>
 1        0.48        0.195        50938          -0.895    0.271    -4.50         -53     0.0422   14.9    0.442
 2        0.132       0.449        32805          -0.870   -0.0271    0.865        -58     0.002   -48.2    0.758
 3        0.482       0.244        46432          -0.896    0.411    -3.97         -61     0.123   -37.4    0.497
 4       -0.511       0.327       304692          -0.0950   0.02      0.815        -48    -0.00890  -7.05  -0.150
 5       -0.216       0.301        47829          -0.896    0.294     2.09         -63     0.0254  -13.0    0.107
 6        0.485       0.0210       81232          -0.0970   0.127    -3.34         -56    -0.0117  -31.2    0.285
 7        0.481       0.379        77805          -0.896    0.0680   -4.33         -60     0.0078  -49.4    0.295
 8        0.485       0.0710       68778          -0.71     0.087    -1.59         -65    -0.0225  -56.3    0.174
 9        0.485       0.0660       29445           0.0640   0.0410  -10.6          -65     0.0227   -4.99   0.429
10       -0.504       0.168       358909          -0.023   -0.016     1.18         -65    -0.0068   -8.05  -0.073
> 

For loops can be slow. In this case we have nested for loops, which could be a problem with large dataframes. A dplyr , *apply(), or data.table solution could be faster.

A faster one-line solution with mapply() (which only loops over columns, with vectorized subtraction of x and y):

data.frame(mapply(function(x,y)x-y, dist, choice))
   acousticness danceability duration_ms instrumentalness liveness loudness popularity speechiness   tempo valence
1      0.480000        0.195       50938       -0.8954780   0.2710   -4.502        -53      0.0422  14.929  0.4420
2      0.132000        0.449       32805       -0.8696000  -0.0271    0.865        -58      0.0020 -48.158  0.7580
3      0.482000        0.244       46432       -0.8959824   0.4110   -3.972        -61      0.1226 -37.447  0.4970
4     -0.510827        0.327      304692       -0.0950000   0.0200    0.815        -48     -0.0089  -7.050 -0.1498
5     -0.216000        0.301       47829       -0.8957540   0.2940    2.090        -63      0.0254 -12.971  0.1070
6      0.485000        0.021       81232       -0.0970000   0.1270   -3.344        -56     -0.0117 -31.177  0.2850
7      0.481000        0.379       77805       -0.8959984   0.0680   -4.327        -60      0.0078 -49.395  0.2950
8      0.485000        0.071       68778       -0.7100000   0.0870   -1.586        -65     -0.0225 -56.263  0.1740
9      0.485000        0.066       29445        0.0640000   0.0410  -10.591        -65      0.0227  -4.987  0.4290
10    -0.504180        0.168      358909       -0.0230000  -0.0160    1.183        -65     -0.0068  -8.054 -0.0730

5 Comments

Hey, you're right! I put two copies of "duration_ms" in distUseful... Ops
The problem is that seq_along() counts the number of elements in a vector. nrow() is just a number. Just change it to 1:nrow(dist) and choice[1,j].
Ok, that worked, but it's also incredibly slow... Maybe there's a faster way? (Keep in mind I'm doing this with a 174.388 long tibble, as is available here)
I found a quite simple one-liner with mapply, @ÉricoPatto. Edited my answer. This must be faster than the nested for loops.
Hooray! That's the fastest one yet! Just took 0.027s elapsed for the entire tibble! (I just changed the data.frame to as_tibble) Hooray!
1

Playing around with everyone's suggestions, I came up with many ideas. Only one of them worked.

I used a modified version of @Johny's function (and corrected my vector distUseful as @GuedesBF mentioned), used the suggestion not to go for a loop and came up with apply:

amend_row <- function(data) {
  data - as.numeric(choice)
}

dist %>%
  apply(X = ., FUN = amend_row, MARGIN = 1) %>%
  t() %>%
  as_tibble()

This gives me:

> dist %>%
+   apply(X = ., FUN = amend_row, MARGIN = 1) %>%
+   t() %>%
+   as_tibble()
# A tibble: 174,388 x 10
   acousticness danceability duration_ms instrumentalness liveness loudness popularity speechiness  tempo valence
          <dbl>        <dbl>       <dbl>            <dbl>    <dbl>    <dbl>      <dbl>       <dbl>  <dbl>   <dbl>
 1        0.48         0.195       50938          -0.895    0.271    -4.50         -53     0.0422   14.9    0.442
 2        0.132        0.449       32805          -0.870   -0.0271    0.865        -58     0.002   -48.2    0.758
 3        0.482        0.244       46432          -0.896    0.411    -3.97         -61     0.123   -37.4    0.497
 4       -0.511        0.327      304692          -0.0950   0.0200    0.815        -48    -0.00890  -7.05  -0.150
 5       -0.216        0.301       47829          -0.896    0.294     2.09         -63     0.0254  -13.0    0.107
 6        0.485        0.021       81232          -0.0970   0.127    -3.34         -56    -0.0117  -31.2    0.285
 7        0.481        0.379       77805          -0.896    0.068    -4.33         -60     0.0078  -49.4    0.295
 8        0.485        0.071       68778          -0.710    0.087    -1.59         -65    -0.0225  -56.3    0.174
 9        0.485        0.066       29445           0.064    0.0410  -10.6          -65     0.0227   -4.99   0.429
10       -0.504        0.168      358909          -0.0230  -0.016     1.18         -65    -0.0068   -8.05  -0.073
# … with 174,378 more rows

In a ridiculously short amount of time.

EDIT: Here is the time difference using only the first 1000 rows:

# MY SOLUTION
> dist <- songs2 %>%
+   select(all_of(distUseful)) %>%
+   head(1000)
> system.time(dist %>%
+               apply(X = ., FUN = subtraction, MARGIN = 1) %>%
+               t() %>%
+               as_tibble())
   user  system elapsed 
  0.006   0.000   0.006 
# THE FUNCTION SOLUTION – DIDN'T WORK PROPERLY (last I checked)
> amend_row <- function(amend_vals, ...) {
+   ... - amend_vals
+ }
> system.time(purrr::pmap(dist, ~ amend_row(amend_vals = choice, .)) %>%
+               do.call(what = rbind, args = .) %>%
+               as_tibble() %>% 
+               purrr::set_names(nm = colnames(dist)))
   user  system elapsed 
  1.222   0.016   1.261 
# NOT A LOT OF TIDYVERSE SOLUTION – SLOOOOOWWWWWW
> system.time(for (i in 1:nrow(dist)){
+   for (j in seq_along(distUseful)){
+     dist[i,j]<-dist[i,j]-choice[1,j]
+   }
+ })
   user  system elapsed 
  7.359   0.046   7.482 

2 Comments

Excellent. It would be nice if you could show the differences in time you got with system.time()
The two other solutions didn't even finish. I know it's faster because it's faster when I run it with head() only and because... well, it actually finished. But I'll put the difference with the head(), good idea.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.