4

I have the following dataframe:

    df <- structure(list(country = c("Ghana", "Eritrea", "Ethiopia", "Ethiopia", 
"Congo - Kinshasa", "Ethiopia", "Ethiopia", "Ghana", "Botswana", 
"Nigeria"), CommodRank = c(1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 
1L), topCommodInCountry = c(TRUE, FALSE, FALSE, TRUE, FALSE, 
TRUE, TRUE, TRUE, TRUE, TRUE), Main_Commod = c("Gold", "Copper", 
"Nickel", "Gold", "Gold", "Gold", "Gold", "Gold", "Diamonds", 
"Iron Ore")), row.names = c(NA, -10L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "country", drop = TRUE, indices = list(
    8L, 4L, 1L, c(2L, 3L, 5L, 6L), c(0L, 7L), 9L), group_sizes = c(1L, 
1L, 1L, 4L, 2L, 1L), biggest_group_size = 4L, labels = structure(list(
    country = c("Botswana", "Congo - Kinshasa", "Eritrea", "Ethiopia", 
    "Ghana", "Nigeria")), row.names = c(NA, -6L), class = "data.frame", vars = "country", drop = TRUE, .Names = "country"), .Names = c("country", 
"CommodRank", "topCommodInCountry", "Main_Commod"))

df

            country CommodRank topCommodInCountry Main_Commod
1             Ghana          1               TRUE        Gold
2           Eritrea          2              FALSE      Copper
3          Ethiopia          3              FALSE      Nickel
4          Ethiopia          1               TRUE        Gold
5  Congo - Kinshasa          3              FALSE        Gold
6          Ethiopia          1               TRUE        Gold
7          Ethiopia          1               TRUE        Gold
8             Ghana          1               TRUE        Gold
9          Botswana          1               TRUE    Diamonds
10          Nigeria          1               TRUE    Iron Ore  

I am trying to add another column showing the top commodity (top CommodRank) for every country in this dataset, but I'm not sure how. I'm able to label 'topcommod' with the 'Main_Commod' where CommodRank == 1, but I want to copy this same value to cases where CommodRank != 1. Looking below, both Ethiopia values at rows 3 & 4 should read 'Gold'.

df %>% mutate(topcommod = ifelse(CommodRank == 1, Main_Commod, 'unknown'))


            country CommodRank topCommodInCountry Main_Commod topcommod
1             Ghana          1               TRUE        Gold      Gold
2           Eritrea          2              FALSE      Copper   unknown
3          Ethiopia          3              FALSE      Nickel   unknown
4          Ethiopia          1               TRUE        Gold      Gold
5  Congo - Kinshasa          3              FALSE        Gold   unknown
6          Ethiopia          1               TRUE        Gold      Gold
7          Ethiopia          1               TRUE        Gold      Gold
8             Ghana          1               TRUE        Gold      Gold
9          Botswana          1               TRUE    Diamonds  Diamonds
10          Nigeria          1               TRUE    Iron Ore  Iron Ore

I'm ideally looking for a dplyr solution I can add to an existing long series of pipe %>% function calls, but any solution would help.

3 Answers 3

5

IIUC, there are multiple ways to do this, for example:

df %>% mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
                       else Main_Commod[which.max(topCommodInCountry)])

# A tibble: 10 x 5
# Groups:   country [6]
   country          CommodRank topCommodInCountry Main_Commod topCom  
   <chr>                 <int> <lgl>              <chr>       <chr>   
 1 Ghana                     1 TRUE               Gold        Gold    
 2 Eritrea                   2 FALSE              Copper      unknown 
 3 Ethiopia                  3 FALSE              Nickel      Gold    
 4 Ethiopia                  1 TRUE               Gold        Gold    
 5 Congo - Kinshasa          3 FALSE              Gold        unknown 
 6 Ethiopia                  1 TRUE               Gold        Gold    
 7 Ethiopia                  1 TRUE               Gold        Gold    
 8 Ghana                     1 TRUE               Gold        Gold    
 9 Botswana                  1 TRUE               Diamonds    Diamonds
10 Nigeria                   1 TRUE               Iron Ore    Iron Ore

Regarding OP's question in comment how to handle ties of multiple top Commodities, you could do the following:

df %>% 
  mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
              else paste(unique(Main_Commod[topCommodInCountry]), collapse = "/"))

If there are multiple unique top Commodities in a country, they will be paste together into a single string, separated by /.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks so much! Off the top of your head is there an obvious way to split and label ties here, such that topCom gets assigned to something like "Gold / Diamonds / ..."? (Say there are 2 or more Main_Commods with CommodRank == 1)
Nevermind you just use 'which' instead of 'which.max' to get all row indices, which you can then access and paste unique names together : df %>% mutate(topCom = Main_Commod[which(topCommodInCountry == max(topCommodInCountry))]) %>% unique %>% paste (sep = '', collapse = '/'))
0

another pattern with dplyr...

df %>% arrange(CommodRank) %>%
    mutate(topCommod = Main_Commod[1])

2 Comments

Sorting the entire data frame (group) will be much slower than getting the max of a single column (group)
In addition to @Ryan comment, if you don't arrange your dataset correctly, doing 'Main_Commod[1]' can be very dangerous/wrong
0

It's not an answer but learning greatly from @docendo discimus answer, it took me a second to understand the "if negative" (!any(topCommodInCountry)), and I was wondering if it's only me or it would take my computer a second more to do that too :)

Using the same dataset I examined the idea of making the if else positive. First I tested for identical between the two solutions:

identical(
  #Negative
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)]), 
  #Positive
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else "unknown"))

[1] TRUE

Next, I tested the benchmark of the two:

require(rbenchmark)

benchmark("Negative" = {
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)])
},
"Positive" = {
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else  "unknown")
},
replications = 10000,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self"))

The difference is not that big but I'm assuming that with a bigger dataset it will increase.

      test replications elapsed relative user.self sys.self
1 Negative        10000   12.59    1.015     12.44        0
2 Positive        10000   12.41    1.000     12.30        0 

2 Comments

On my four year old laptop, running for(i in 1:1e6) !TRUE takes about 1/10th of a second. Not worth worrying about.
It may be worth removing unnecessary ! just for readability, but for what it's worth I think it's pretty intuitive if ! is read as "not" i.e. "If not any topCommodInCountry"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.