1

I’m trying to do some that preprocessing, and want to convert the classe factors values {A,B,C,D,E} to {1,2,3,4,5}.

The classe column is of type factor, I have provided all steps, see below:

    #get the data
    training <- read.table("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
    training_df <- data.frame(training,stringsAsFactors=FALSE)

    #split to training & test sets
    inTrain <- createDataPartition(y=training$classe, p=0.75, list=FALSE)
    training_data <- training[inTrain,]
    testing_data <- training[-inTrain,]

    #subset based on columns of interest, based on previous studies
    training_data_subset <- subset(training_data, select=c("avg_roll_belt","var_roll_belt","var_total_accel_belt","amplitude_roll_belt","max_roll_belt","var_roll_belt",
    "var_accel_arm","magnet_arm_x","magnet_arm_y","magnet_arm_z","accel_dumbbell_y","accel_dumbbell_z","magnet_dumbbell_x","gyros_dumbbell_x",
    "gyros_dumbbell_y","gyros_dumbbell_z","pitch_forearm","gyros_forearm_x","gyros_forearm_y","classe"))

    #see which columns are factors, the training_data_subset#classe feature is a factor
    sapply(training_data_subset, class)

#sapply output

 avg_roll_belt        var_roll_belt var_total_accel_belt  amplitude_roll_belt        max_roll_belt 
           "numeric"            "numeric"            "numeric"            "numeric"            "numeric" 
     var_roll_belt.1        var_accel_arm         magnet_arm_x         magnet_arm_y         magnet_arm_z 
           "numeric"            "numeric"            "integer"            "integer"            "integer" 
    accel_dumbbell_y     accel_dumbbell_z    magnet_dumbbell_x     gyros_dumbbell_x     gyros_dumbbell_y 
           "integer"            "integer"            "integer"            "numeric"            "numeric" 
    gyros_dumbbell_z        pitch_forearm      gyros_forearm_x      gyros_forearm_y               classe 
           "numeric"            "numeric"            "numeric"            "numeric"             "factor" 

I created a function that attempts to replace A=1,B=2,C=3,D=4,E=5, see below:

factorsToNumeric <- function(data)
{
    data_numeric <- data
    data_numeric$classe <-as.numeric(factor(toupper(as.character(data_numeric$classe))))
    #loop through the data frame based on replace values
    for(i in 1:nrow(data_numeric)) 
    {

    if ((data_numeric[i,]$classe == "A") || (data_numeric[i,]$classe  == "a")) 
    {data_numeric[i,]$classe <- "1"}
    else if ((data_numeric[i,]$classe  == "B") || (data_numeric[i,]$classe  == "b"))
    {data_numeric[i,]$classe <- "2"}
    else if ((data_numeric[i,]$classe  == "C") || (data_numeric[i,]$classe  == "c"))
    {data_numeric[i,]$classe <- "3"}
    else if ((data_numeric[i,]$classe  == "D") || (data_numeric[i,]$classe  == "d"))
    {data_numeric[i,]$classe <- "4"}
    else if ((data_numeric[i,]$classe  == "E") || (data_numeric[i,]$classe  == "e"))
    {data_numeric[i,]$classe <- "5"}
    else 
    {
    #do nothing 
    }

    }

    return (data_numeric)
}

However, I get this error:

training_data_subset_numeric <- factorsToNumeric(training_data_subset)

Error:

Warning messages:
1: In `[<-.factor`(`*tmp*`, iseq, value = "1") :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, iseq, value = "1") :
  invalid factor level, NA generated
3: In `[<-.factor`(`*tmp*`, iseq, value = "1") :
  invalid factor level, NA generated
4: In `[<-.factor`(`*tmp*`, iseq, value = "1") :
  invalid factor level, NA generated
5: In `[<-.factor`(`*tmp*`, iseq, value = "1") :
  invalid factor level, NA generated
6: In `[<-.factor`(`*tmp*`, iseq, value = "1") :
  invalid factor level, NA generated
7: In `[<-.factor`(`*tmp*`, iseq, value = "1") :
  invalid factor level, NA generated
8: In `[<-.factor`(`*tmp*`, iseq, value = "1") :
  invalid factor level, NA generated
9: In `[<-.factor`(`*tmp*`, iseq, value = "1") :
  invalid factor level, NA generated

Further inspection shows the column "classe"'s class is converted to "numeric":

 sapply(training_data_subset_numeric, class)

 avg_roll_belt        var_roll_belt var_total_accel_belt  amplitude_roll_belt        max_roll_belt 
       "numeric"            "numeric"            "numeric"            "numeric"            "numeric" 
 var_roll_belt.1        var_accel_arm         magnet_arm_x         magnet_arm_y         magnet_arm_z 
       "numeric"            "numeric"            "integer"            "integer"            "integer" 
accel_dumbbell_y     accel_dumbbell_z    magnet_dumbbell_x     gyros_dumbbell_x     gyros_dumbbell_y 
       "integer"            "integer"            "integer"            "numeric"            "numeric" 
gyros_dumbbell_z        pitch_forearm      gyros_forearm_x      gyros_forearm_y               classe 
       "numeric"            "numeric"            "numeric"            "numeric"            "numeric"

However, the head function confirms the error above & all the values A,B,C,D,E have been replaced with NA incorrectly.

3
  • 1
    You could try the following approach training_data_subset$classe <- as.numeric(factor(toupper(as.character(training_data_subset$classe)))) Commented Jan 3, 2015 at 0:55
  • @docendo discimus, I added the correct variation of your suggestion at the beginning of the factorsToNumeric() function and it worked, see edited post. Commented Jan 3, 2015 at 7:40
  • 1
    You misunderstood. That code should replace replace the function. Commented Jan 3, 2015 at 7:58

2 Answers 2

2

Factors don't work like that. You can't change values with simple <- assignment like other data types. There are a few different ways you can change a factor. Here's one way using the levels<- replacement function.

Here's a sample from your enormous data that took way too long to read :) For this data it's easy because the levels are in the right sequential order already.

set.seed(2)
x <- sample(training$classe, 20)
x
# [1] A D C A E E A E B C C A D A B E E A B A
# Levels: A B C D E
levels(x) <- 1:5
x
# [1] 1 4 3 1 5 5 1 5 2 3 3 1 4 1 2 5 5 1 2 1
# Levels: 1 2 3 4 5

So there's no need for your function. You can simply do

levels(training$classe) <- 1:5

and we can see the str of the new column shows the changed values

str(training$classe)
# Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

Note that for this simple case, as.integer(training$classe) also works. Although it won't be that easy most of the time.

Sign up to request clarification or add additional context in comments.

Comments

0

If you want to convert the classe column of training_data_subset you don't need to define your own function. You can use the LETTERS vector:

sapply(training_data_subset[,'classe'], function(x) which(LETTERS==x))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.