3

I am relatively new to R. My question isn't entirely as straightforward as the title. This is a sample of what df looks like:

id    amenities
1     wireless internet, air conditioning, pool, kitchen
2     pool, kitchen, washer, dryer
3     wireless internet, kitchen, dryer
4     
5     wireless internet

this is what i want df to look like:

id    wireless internet   air conditioning   pool   kitchen   washer   dryer
1     1                   1                  1      1         0        0
2     0                   0                  1      1         1        1
3     1                   0                  0      1         0        1
4     0                   0                  0      0         0        0
5     1                   0                  0      0         0        0

sample code to reproduce data

df <- data.frame(id = c(1, 2, 3, 4, 5),
      amenities = c("wireless internet, air conditioning, pool, kitchen",  
                    "pool, kitchen, washer, dryer", 
                    "wireless internet, kitchen, dryer", 
                    "", 
                    "wireless internet"), 
      stringsAsFactors = FALSE)

4 Answers 4

6

FWIW, here's a base R approach (assuming that df contains your data as shown in the question)

dat <- with(df, strsplit(amenities, ', '))
df2 <- data.frame(id = factor(rep(df$id, times = lengths(dat)),
                              levels = df$id),
                  amenities = unlist(dat))
df3 <- as.data.frame(cbind(id = df$id,
                     table(df2$id, df2$amenities)))

This results in

> df3
  id air conditioning dryer kitchen pool washer wireless internet
1  1                1     0       1    1      0                 1
2  2                0     1       1    1      1                 0
3  3                0     1       1    0      0                 1
4  4                0     0       0    0      0                 0
5  5                0     0       0    0      0                 1

Breaking down what is going on:

  1. dat <- with(df, strsplit(amenities, ', ')) splits the amenities variable on ', ', resulting in

    > dat
    [[1]]
    [1] "wireless internet" "air conditioning"  "pool"             
    [4] "kitchen"          
    
    [[2]]
    [1] "pool"    "kitchen" "washer"  "dryer"  
    
    [[3]]
    [1] "wireless internet" "kitchen"           "dryer"            
    
    [[4]]
    character(0)
    
    [[5]]
    [1] "wireless internet"
    
  2. The second line takes dat and turns it into a vector, and we add on and id column by repeating the origina id values as many times as the number of amenities for that id. This results in

    > df2
       id         amenities
    1   1 wireless internet
    2   1  air conditioning
    3   1              pool
    4   1           kitchen
    5   2              pool
    6   2           kitchen
    7   2            washer
    8   2             dryer
    9   3 wireless internet
    10  3           kitchen
    11  3             dryer
    12  5 wireless internet
    
  3. Use the table() function to create the contingency table and then we add on an id column.

Sign up to request clarification or add additional context in comments.

1 Comment

I had a slight variation - df2 <- stack(setNames(dat,seq_along(dat)))[2:1] and then cbind(df["id"], unclass(table(df2)))
3

A solution using dplyr and tidyr. Notice that I replace "" with None because it is easier to process the column names later.

library(dplyr)
library(tidyr)

df2 <- df %>%
  separate_rows(amenities, sep = ",") %>%
  mutate(amenities = ifelse(amenities %in% "", "None", amenities)) %>%
  mutate(value = 1) %>%
  spread(amenities, value , fill = 0) %>%
  select(-None)
df2
#   id  air conditioning  dryer  kitchen  pool  washer pool wireless internet
# 1  1                 1      0        1     1       0    0                 1
# 2  2                 0      1        1     0       1    1                 0
# 3  3                 0      1        1     0       0    0                 1
# 4  4                 0      0        0     0       0    0                 0
# 5  5                 0      0        0     0       0    0                 1

Comments

0

The dummies package can be useful here. Try

library(dplyr); library(tidyr); library(dummies)
df2 <- df %>% separate_rows(amenities, sep = ",")
df2$amenities <- trimws(df2$amenities, "both") # remove spaces (left and right) - so that you will not have 2 "pool" columns in your final data frame
df2 <- dummy.data.frame(df2)[, -2]
colnames(df2) <- trimws(gsub("amenities", "", colnames(df2)), "both") # arrange colnames
df3 <- df2 %>% 
  group_by(id) %>%
  summarise_all(funs(sum)) ## aggregate by column and id
df3

# A tibble: 5 x 7
#id `air conditioning` dryer kitchen  pool washer `wireless internet`
#<dbl>              <int> <int>   <int> <int>  <int>               <int>
#     1                  1     0       1     1      0                   1
#     2                  0     1       1     1      1                   0
#     3                  0     1       1     0      0                   1
#     4                  0     0       0     0      0                   0
#     5                  0     0       0     0      0                   1

Comments

0

For the sake of completeness, here is also a concise data.table solution:

library(data.table)
setDT(df)[, strsplit(amenities, ", "), by = id][
  , dcast(.SD, id ~ V1, length)]
   id air conditioning dryer kitchen pool washer wireless internet
1:  1                1     0       1    1      0                 1
2:  2                0     1       1    1      1                 0
3:  3                0     1       1    0      0                 1
4:  5                0     0       0    0      0                 1

After coercion to data.table, amenities is split by ", " into separate rows for each item (long format). This is then reshaped into wide format, using the length() function to aggregate.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.