2

I have an R data frame with movies from IMDB.

(Here is the CSV file: http://had.co.nz/data/movies/movies.tab.gz)

Genres are defined by the binary table:

$ Action      (int) 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,...
$ Animation   (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Comedy      (int) 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,...
$ Drama       (int) 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,...
$ Documentary (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Romance     (int) 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
$ Short       (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...

I am wondering: is there an elegant, R-native way to convert this binary table into the string like “Comedy, Romance” in the same data frame?

Thank you in advance for your help!

1
  • 1
    Please show a small reproducible example and expected output instead of .gz file Commented Jan 8, 2016 at 12:25

3 Answers 3

2

I think this is what you want.

# Create some toy data like yours
set.seed(1)
n <- 5
ds <- as.data.frame(replicate(7, sample(0:1, n, replace = TRUE)))
names(ds) <- c("Action", "Animation", "Comedy", "Drama",
                "Documentary", "Romance", "Short")
print(ds)
#  Action Animation Comedy Drama Documentary Romance Short
#1      0         1      0     0           1       0     0
#2      0         1      0     1           0       0     1
#3      1         1      1     1           1       0     0
#4      1         1      0     0           0       1     0
#5      0         0      1     1           0       0     1

# Use each row as indicator vector
apply(ds, 1, function(r) paste(names(ds)[as.logical(r)], collapse = ", "))
#[1] "Animation, Documentary"                       
#[2] "Animation, Drama, Short"                      
#[3] "Action, Animation, Comedy, Drama, Documentary"
#[4] "Action, Animation, Romance"                   
#[5] "Comedy, Drama, Short" 
Sign up to request clarification or add additional context in comments.

Comments

0

Here is another option using data.table

library(data.table)
library(reshape2)
 setDT(melt(as.matrix(ds)))[value!=0][,toString(Var2) ,Var1]

Comments

0

I'd also opt for data.table:

library(readr)
library(data.table)
dt <- read_tsv("http://had.co.nz/data/movies/movies.tab.gz")
dt <- setkey(melt(setDT(dt), id.vars=1:17)[value==1], "title")
(dt <- unique(dt[dt[, .(categories=list(variable)), by=title]][, c("variable", "value"):=NULL]))
#                          title year length   budget rating votes   r1   r2  r3   r4   r5   r6   r7   r8   r9  r10  mpaa      categories
#     1:                       $ 1971    121       NA    6.4   348  4.5  4.5 4.5  4.5 14.5 24.5 24.5 14.5  4.5  4.5    NA    Comedy,Drama
#     2:       $1000 a Touchdown 1939     71       NA    6.0    20  0.0 14.5 4.5 24.5 14.5 14.5 14.5  4.5  4.5 14.5    NA          Comedy
#     3:  $21 a Day Once a Month 1941      7       NA    8.2     5  0.0  0.0 0.0  0.0  0.0 24.5  0.0 44.5 24.5 24.5    NA Animation,Short
#     4:                 $40,000 1996     70       NA    8.2     6 14.5  0.0 0.0  0.0  0.0  0.0  0.0  0.0 34.5 45.5    NA          Comedy
#     5:                   $pent 2000     91       NA    4.3    45  4.5  4.5 4.5 14.5 14.5 14.5  4.5  4.5 14.5 14.5    NA           Drama
#    ---                                                                                                                                 
# 44177:                  sIDney 2002     15       NA    7.0     8 14.5  0.0 0.0 14.5  0.0  0.0 24.5 14.5 14.5 24.5    NA    Action,Short
# 44178:               tom thumb 1958     98       NA    6.5   274  4.5  4.5 4.5  4.5 14.5 14.5 24.5 14.5  4.5  4.5    NA       Animation
# 44179:             www.XXX.com 2003    105       NA    1.1    12 45.5  0.0 0.0  0.0  0.0  0.0 24.5  0.0  0.0 24.5    NA   Drama,Romance
# 44180:                     xXx 2002    132 85000000    5.5 18514  4.5  4.5 4.5  4.5 14.5 14.5 14.5 14.5  4.5  4.5 PG-13          Action
# 44181: xXx: State of the Union 2005    101 87000000    3.9  1584 24.5  4.5 4.5  4.5  4.5 14.5  4.5  4.5  4.5 14.5 PG-13          Action

You may want to leave the categories columns as vectors or lists in order to be able to process it easily:

head(dt$categories, 2)
# [[1]]
# [1] Comedy Drama 
# Levels: Action Animation Comedy Drama Documentary Romance Short
# 
# [[2]]
# [1] Comedy
# Levels: Action Animation Comedy Drama Documentary Romance Short

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.