-2

I want to add a new column to a data.frame based on the last entry in a string so that I can do regression analysis. In particular, I am trying to extract information from the Fertiliser variable which contains information on the amount of three different types of fertiliser, separated by hyphens. For instance: 0-0-0 or 30-10-2700 are viable. The last portion of the string is what I need to capture e.g. x-y-z I need z.

I tried but was not able to adapt the answer here How to create new column in dataframe based on partial string matching other column in R to this case.

More complete example of the data:

    Fertiliser millet_biomass millet_yield
 1:      0-0-0       2659.608     710.6942
 2:    0-0-100       2701.044     718.1154
 3:   0-0-2700       3415.879     804.0360
 4:    0-0-300       2781.639     730.5943
 5:    0-0-900       2997.173     760.0136
 6:     12-4-0       3703.255     772.1719
 7:   12-4-100       3720.247     773.1759
 8:  12-4-2700       3950.189     788.6133
 9:   12-4-300       3751.400     775.1368
10:   12-4-900       3826.693     780.2623
11:    30-10-0       4180.323     798.2134
12:  30-10-100       4184.229     798.4918
13: 30-10-2700       4217.044     800.9312
14:  30-10-300       4187.014     798.6570
15:  30-10-900       4194.873     799.2085
16:      6-2-0       3296.274     765.8496
17:    6-2-100       3326.844     767.6693
18:   6-2-2700       3772.058     785.4535
19:    6-2-300       3381.152     760.7330
20:    6-2-900       3517.515     768.3018
21:    90-30-0       4542.924     831.2832
22:  90-30-100       4543.036     831.3983
23: 90-30-2700       4545.037     831.3227
24:  90-30-300       4543.240     831.3921
25:  90-30-900       4543.733     831.3727

Thus, there are five patterns -0$, -100$, -300$, -900$, 2700$, which need to be replaced by 0, 100, 300, 900, 2700

3 Answers 3

3

Is this what you want to do? Let's take a snippet of data:

Fertiliser <- c("0-0-0", "0-0-100", "0-0-2700", "0-0-300")
millet_yield <- c(710, 718, 804, 730)
df <- data.frame(Fertiliser, millet_yield)

df looks like this:

   Fertiliser millet_yield
#1      0-0-0          710
#2    0-0-100          718
#3   0-0-2700          804
#4    0-0-300          730

Using separate() from the tidyr package:

library(tidyr)
df %>% separate(Fertiliser, into=(c("F1", "F2", "Manure")), sep="-", convert=T)

   F1 F2 Manure millet_yield
#1  0  0      0          710
#2  0  0    100          718
#3  0  0   2700          804
#4  0  0    300          730

convert=T makes sure the character strings become numeric. Now you can run a regression on your data.

Sign up to request clarification or add additional context in comments.

4 Comments

Yes, this is exactly what I need. But the columns are not added to the data.frame. How can I do it? I am not used to the operator %>%
Oh, you mean you wanted to keep the original column? Then it's df %>% separate(Fertiliser, into=(c("F1", "F2", "Manure")), sep="-", convert=T, remove=F). All other columns will be kept. The operator %>% is just for chaining functions together in tidyr and dplyr.
No no, the original column is not necessary. The result shows perfect in the console, but not in the data.frame (Environment->Data). It still displays the same variables as before running the line :S
df2 <- df %>% separate(Fertiliser, into=(c("F1", "F2", "Manure")), sep="-", convert=T)
0

We can extract the last numbers with stri_extract_last from stringi. As the dataset is a already a data.table, can use the data.table methods to assign (:=) a new column.

library(data.table)
library(stringi)
setDT(df1)[, Manure := as.numeric(stri_extract_last_regex(Fertiliser, "\\d+"))]
head(df1)
#   Fertiliser millet_biomass millet_yield Manure
#1:      0-0-0       2659.608     710.6942      0
#2:    0-0-100       2701.044     718.1154    100
#3:   0-0-2700       3415.879     804.0360   2700
#4:    0-0-300       2781.639     730.5943    300
#5:    0-0-900       2997.173     760.0136    900
#6:     12-4-0       3703.255     772.1719      0

Or a base R option is

df1$Manure <- as.numeric(sub(".*-", "", df1$Fertiliser))

4 Comments

It works. I really like this style. Can we also use stringi to create a new column with only certain words within a character? i.e, from a column with values "Millet high SOC"; "Millet medium SOC" and "Millet low SOC" I want to create a new column which either contain "high", medium" or "low"
@M.Jimenez You could use setDT(df1)[, NewCol := stri_extract(Col, regex="(?<=\\s)\\w+")]
Concise and effective. Thank you
Oh thank you. I didn't know I could choose only one. Indeed all have being good and I've tried to vote for all :-). I've used your solution in combination with another one.
0

You can easily do this with sub, removing everything up to and including the last hyphen character:

transform(x, Fertiliser = sub('.*-', '', Fertiliser))
##     Fertiliser millet_biomass millet_yield
## 1:           0       2659.608     710.6942
## 2:         100       2701.044     718.1154
## 3:        2700       3415.879     804.0360
## 4:         300       2781.639     730.5943

...

Here, the .* is greedy, so it matches as much as possible before matching the final - character.

You can also rename the resulting column, rather than replacing Fertiliser:

 transform(x, Quantity = sub('.*-', '', Fertiliser))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.