Create new variable in dataframe based on condition in one column, pulling from other column? (dplyr)

Question

I have the following dataframe:

    df <- structure(list(country = c("Ghana", "Eritrea", "Ethiopia", "Ethiopia", 
"Congo - Kinshasa", "Ethiopia", "Ethiopia", "Ghana", "Botswana", 
"Nigeria"), CommodRank = c(1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 
1L), topCommodInCountry = c(TRUE, FALSE, FALSE, TRUE, FALSE, 
TRUE, TRUE, TRUE, TRUE, TRUE), Main_Commod = c("Gold", "Copper", 
"Nickel", "Gold", "Gold", "Gold", "Gold", "Gold", "Diamonds", 
"Iron Ore")), row.names = c(NA, -10L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "country", drop = TRUE, indices = list(
    8L, 4L, 1L, c(2L, 3L, 5L, 6L), c(0L, 7L), 9L), group_sizes = c(1L, 
1L, 1L, 4L, 2L, 1L), biggest_group_size = 4L, labels = structure(list(
    country = c("Botswana", "Congo - Kinshasa", "Eritrea", "Ethiopia", 
    "Ghana", "Nigeria")), row.names = c(NA, -6L), class = "data.frame", vars = "country", drop = TRUE, .Names = "country"), .Names = c("country", 
"CommodRank", "topCommodInCountry", "Main_Commod"))

df

            country CommodRank topCommodInCountry Main_Commod
1             Ghana          1               TRUE        Gold
2           Eritrea          2              FALSE      Copper
3          Ethiopia          3              FALSE      Nickel
4          Ethiopia          1               TRUE        Gold
5  Congo - Kinshasa          3              FALSE        Gold
6          Ethiopia          1               TRUE        Gold
7          Ethiopia          1               TRUE        Gold
8             Ghana          1               TRUE        Gold
9          Botswana          1               TRUE    Diamonds
10          Nigeria          1               TRUE    Iron Ore

I am trying to add another column showing the top commodity (top CommodRank) for every country in this dataset, but I'm not sure how. I'm able to label 'topcommod' with the 'Main_Commod' where CommodRank == 1, but I want to copy this same value to cases where CommodRank != 1. Looking below, both Ethiopia values at rows 3 & 4 should read 'Gold'.

df %>% mutate(topcommod = ifelse(CommodRank == 1, Main_Commod, 'unknown'))


            country CommodRank topCommodInCountry Main_Commod topcommod
1             Ghana          1               TRUE        Gold      Gold
2           Eritrea          2              FALSE      Copper   unknown
3          Ethiopia          3              FALSE      Nickel   unknown
4          Ethiopia          1               TRUE        Gold      Gold
5  Congo - Kinshasa          3              FALSE        Gold   unknown
6          Ethiopia          1               TRUE        Gold      Gold
7          Ethiopia          1               TRUE        Gold      Gold
8             Ghana          1               TRUE        Gold      Gold
9          Botswana          1               TRUE    Diamonds  Diamonds
10          Nigeria          1               TRUE    Iron Ore  Iron Ore

I'm ideally looking for a dplyr solution I can add to an existing long series of pipe %>% function calls, but any solution would help.

talat · Accepted Answer · 2018-05-17 07:10:52Z

5

IIUC, there are multiple ways to do this, for example:

df %>% mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
                       else Main_Commod[which.max(topCommodInCountry)])

# A tibble: 10 x 5
# Groups:   country [6]
   country          CommodRank topCommodInCountry Main_Commod topCom  
   <chr>                 <int> <lgl>              <chr>       <chr>   
 1 Ghana                     1 TRUE               Gold        Gold    
 2 Eritrea                   2 FALSE              Copper      unknown 
 3 Ethiopia                  3 FALSE              Nickel      Gold    
 4 Ethiopia                  1 TRUE               Gold        Gold    
 5 Congo - Kinshasa          3 FALSE              Gold        unknown 
 6 Ethiopia                  1 TRUE               Gold        Gold    
 7 Ethiopia                  1 TRUE               Gold        Gold    
 8 Ghana                     1 TRUE               Gold        Gold    
 9 Botswana                  1 TRUE               Diamonds    Diamonds
10 Nigeria                   1 TRUE               Iron Ore    Iron Ore

Regarding OP's question in comment how to handle ties of multiple top Commodities, you could do the following:

df %>% 
  mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
              else paste(unique(Main_Commod[topCommodInCountry]), collapse = "/"))

If there are multiple unique top Commodities in a country, they will be paste together into a single string, separated by /.

edited May 17, 2018 at 7:10

answered May 16, 2018 at 15:13

talat

70.5k22 gold badges130 silver badges158 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dad Over a year ago

Thanks so much! Off the top of your head is there an obvious way to split and label ties here, such that topCom gets assigned to something like "Gold / Diamonds / ..."? (Say there are 2 or more Main_Commods with CommodRank == 1)

dad Over a year ago

Nevermind you just use 'which' instead of 'which.max' to get all row indices, which you can then access and paste unique names together :

df %>% mutate(topCom = Main_Commod[which(topCommodInCountry == max(topCommodInCountry))]) %>% unique %>% paste (sep = '', collapse = '/'))

Nate · Accepted Answer · 2018-05-16 15:31:49Z

0

another pattern with dplyr...

df %>% arrange(CommodRank) %>%
    mutate(topCommod = Main_Commod[1])

edited May 16, 2018 at 15:31

answered May 16, 2018 at 15:21

Nate

10.7k3 gold badges35 silver badges42 bronze badges

2 Comments

IceCreamToucan Over a year ago

Sorting the entire data frame (group) will be much slower than getting the max of a single column (group)

DJV Over a year ago

In addition to @Ryan comment, if you don't arrange your dataset correctly, doing 'Main_Commod[1]' can be very dangerous/wrong

DJV · Accepted Answer · 2018-05-16 16:29:20Z

0

It's not an answer but learning greatly from @docendo discimus answer, it took me a second to understand the "if negative" (!any(topCommodInCountry)), and I was wondering if it's only me or it would take my computer a second more to do that too :)

Using the same dataset I examined the idea of making the if else positive. First I tested for identical between the two solutions:

identical(
  #Negative
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)]), 
  #Positive
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else "unknown"))

[1] TRUE

Next, I tested the benchmark of the two:

require(rbenchmark)

benchmark("Negative" = {
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)])
},
"Positive" = {
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else  "unknown")
},
replications = 10000,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self"))

The difference is not that big but I'm assuming that with a bigger dataset it will increase.

      test replications elapsed relative user.self sys.self
1 Negative        10000   12.59    1.015     12.44        0
2 Positive        10000   12.41    1.000     12.30        0

answered May 16, 2018 at 16:29

DJV

4,9133 gold badges22 silver badges36 bronze badges

2 Comments

IceCreamToucan Over a year ago

On my four year old laptop, running for(i in 1:1e6) !TRUE takes about 1/10th of a second. Not worth worrying about.

IceCreamToucan Over a year ago

It may be worth removing unnecessary ! just for readability, but for what it's worth I think it's pretty intuitive if ! is read as "not" i.e. "If not any topCommodInCountry"

Collectives™ on Stack Overflow

Create new variable in dataframe based on condition in one column, pulling from other column? (dplyr)

3 Answers 3

2 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related