R - Handling multiple values as one string in a single variable

Question

In a data.frame, I have a categorical variable for the language of a text. But, while most texts are only in one language, some have multiple languages. In my data, they appear in the same column, divided by comas:

text = c("Text1", "Text2", "Text3")
lang = c("fr", "en", "fr,en")
d = data.frame(text, lang)

Visually:

   text  lang
1 Text1    fr
2 Text2    en
3 Text3 fr,en

I'd like to plot the number of texts in each language, with Text3 being counted both in fr and in en.

I found how to split, with:

d$lang <- strsplit(d$lang, ",")

But then I can't find a way to plot it correctly, e.g. with a qplot barplot like this one:

qplot(lang, data=d)

Am I doing it right? Is there a better approach?

You cant pass a list to qplot like that and its default plot is a scatter plot. Try qplot(x=unlist(strsplit(as.character(d$lang), ",")), geom="bar") or for a non-ggplot answer.barplot(table(unlist(strsplit(as.character(d$lang), ",")))) or — user20650
– user20650, Commented May 2, 2015 at 0:49
Thanks a lot. Is there a way to use unlist while maintaining other columns of data? In the above example, let's say I also have a third column which I want to keep aligned with lang, is there a way? Maybe by duplicating observations? — iNyar
– iNyar, Commented May 2, 2015 at 1:40

Steven Beaupré · Accepted Answer · 2015-05-02 03:05:05Z

6

You could try:

library(splitstackshape)
dl <- cSplit(d, "lang", ",", "long")
qplot(lang, data = dl)

answered May 2, 2015 at 3:05

Steven Beaupré

21.7k7 gold badges60 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

iNyar Over a year ago

Now I get it... ! (after reading the splitstackshape documentation :-)) That package is perfect: thanks a lot! Indeed, what I needed was: cSplit(d, "lang"), which is the same as cSplit(d, "lang", ",", "wide")

Richard Ambler · Accepted Answer · 2015-05-02 03:26:53Z

3

Without following the suggestion in user20650's comment, you probably won't be able to get away without restructuring your data, and how you do that cannot be blind to the way the data is arbitrarily stored. For example, if you know that the languages are represented by distinct, two-character strings (so that, for example, any language representation that isn't "fr" does not contain the sequence "fr"), you could create new boolean columns based on searches for the codes in the comma-separated representation. For example:

# Data
text = c("Text1", "Text2", "Text3", "Text4", "Text5")
lang = c("fr", "en", "fr,en", "sp,fr", "sp,fr,en")
d = data.frame(text, lang, stringsAsFactors = FALSE)

# Get a vector of the languages that exist
languages <- unique(unlist(strsplit(d$lang, ",")))

# Create a new column for each language
for (language in languages) d[[language]] <- grepl(language, d$lang)

# An example bar-plot
barplot(colSums(d[, -c(1, 2)]))

answered May 2, 2015 at 3:26

Richard Ambler

5,0902 gold badges25 silver badges40 bronze badges

2 Comments

iNyar Over a year ago

Thanks a lot. It took me some more time to fully understand your answer (because of my very basic R understanding), but now that I do, that's exactly what I needed. :)

rocknRrr Over a year ago

@Richard Ambler This is very VERY useful code and amazing!!! Then, I have a question about the line ; for (language in languages) d[[language]] <- grepl(language, d$lang) that how does for(language in languages) work? does for(language in languages) create number of new columns(language) based on length of pattern which is languages here? I always used for ( i in (1:n)), and wonder how for (list in list) works ;Also how can we use language as pattern for prepl() , because language columns are not created yet but used to define language by grepl(language, d$lang).

wibeasley · Accepted Answer · 2015-05-02 02:41:30Z

Consider tidyr::separate() to split and tidyr::gather() to make it long.

library(magrittr)
ceiling <- 2L #The max language count of any single text
language_positions <- paste0("language_", seq_len(ceiling))

d %>%
  tidyr::separate("lang", language_positions, sep=",", extra="merge") %>%
  tidyr::gather_("ordinal", "language_name", language_positions) %>%
  dplyr::filter(!is.na(language_name))

The resulting long dataset is:

   text    ordinal language_name
1 Text1 language_1            fr
2 Text2 language_1            en
3 Text3 language_1            fr
4 Text3 language_2            en

If you want to break it into two smaller steps. The separate() creates a wide dataset,

> d_wide <- d %>%
+   tidyr::separate_("lang", language_positions, sep=",", extra="merge")
> d_wide
   text language_1 language_2
1 Text1         fr       <NA>
2 Text2         en       <NA>
3 Text3         fr         en

...and then gather() converts it to tall.

d_long <- d_wide %>%
  tidyr::gather_("ordinal", "language_name", language_positions) %>%
  dplyr::filter(!is.na(language_name))

For other reasons, I suggest adding , stringsAsFactors=F when you define d, but tidyr's separate functions don't seem to mind. The qplot call can remain the same: qplot(language_name, data=d_long).

Collectives™ on Stack Overflow

R - Handling multiple values as one string in a single variable

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related