7

In a data.frame, I have a categorical variable for the language of a text. But, while most texts are only in one language, some have multiple languages. In my data, they appear in the same column, divided by comas:

text = c("Text1", "Text2", "Text3")
lang = c("fr", "en", "fr,en")
d = data.frame(text, lang)

Visually:

   text  lang
1 Text1    fr
2 Text2    en
3 Text3 fr,en

I'd like to plot the number of texts in each language, with Text3 being counted both in fr and in en.

I found how to split, with:

d$lang <- strsplit(d$lang, ",")

But then I can't find a way to plot it correctly, e.g. with a qplot barplot like this one:

qplot(lang, data=d)

Am I doing it right? Is there a better approach?

2
  • 3
    You cant pass a list to qplot like that and its default plot is a scatter plot. Try qplot(x=unlist(strsplit(as.character(d$lang), ",")), geom="bar") or for a non-ggplot answer.barplot(table(unlist(strsplit(as.character(d$lang), ",")))) or Commented May 2, 2015 at 0:49
  • Thanks a lot. Is there a way to use unlist while maintaining other columns of data? In the above example, let's say I also have a third column which I want to keep aligned with lang, is there a way? Maybe by duplicating observations? Commented May 2, 2015 at 1:40

3 Answers 3

6

You could try:

library(splitstackshape)
dl <- cSplit(d, "lang", ",", "long")
qplot(lang, data = dl)
Sign up to request clarification or add additional context in comments.

1 Comment

Now I get it... ! (after reading the splitstackshape documentation :-)) That package is perfect: thanks a lot! Indeed, what I needed was: cSplit(d, "lang"), which is the same as cSplit(d, "lang", ",", "wide")
3

Without following the suggestion in user20650's comment, you probably won't be able to get away without restructuring your data, and how you do that cannot be blind to the way the data is arbitrarily stored. For example, if you know that the languages are represented by distinct, two-character strings (so that, for example, any language representation that isn't "fr" does not contain the sequence "fr"), you could create new boolean columns based on searches for the codes in the comma-separated representation. For example:

# Data
text = c("Text1", "Text2", "Text3", "Text4", "Text5")
lang = c("fr", "en", "fr,en", "sp,fr", "sp,fr,en")
d = data.frame(text, lang, stringsAsFactors = FALSE)

# Get a vector of the languages that exist
languages <- unique(unlist(strsplit(d$lang, ",")))

# Create a new column for each language
for (language in languages) d[[language]] <- grepl(language, d$lang)

# An example bar-plot
barplot(colSums(d[, -c(1, 2)]))

2 Comments

Thanks a lot. It took me some more time to fully understand your answer (because of my very basic R understanding), but now that I do, that's exactly what I needed. :)
@Richard Ambler This is very VERY useful code and amazing!!! Then, I have a question about the line ; for (language in languages) d[[language]] <- grepl(language, d$lang) that how does for(language in languages) work? does for(language in languages) create number of new columns(language) based on length of pattern which is languages here? I always used for ( i in (1:n)), and wonder how for (list in list) works ;Also how can we use language as pattern for prepl() , because language columns are not created yet but used to define language by grepl(language, d$lang).
1

Consider tidyr::separate() to split and tidyr::gather() to make it long.

library(magrittr)
ceiling <- 2L #The max language count of any single text
language_positions <- paste0("language_", seq_len(ceiling))

d %>%
  tidyr::separate("lang", language_positions, sep=",", extra="merge") %>%
  tidyr::gather_("ordinal", "language_name", language_positions) %>%
  dplyr::filter(!is.na(language_name))

The resulting long dataset is:

   text    ordinal language_name
1 Text1 language_1            fr
2 Text2 language_1            en
3 Text3 language_1            fr
4 Text3 language_2            en

If you want to break it into two smaller steps. The separate() creates a wide dataset,

> d_wide <- d %>%
+   tidyr::separate_("lang", language_positions, sep=",", extra="merge")
> d_wide
   text language_1 language_2
1 Text1         fr       <NA>
2 Text2         en       <NA>
3 Text3         fr         en

...and then gather() converts it to tall.

d_long <- d_wide %>%
  tidyr::gather_("ordinal", "language_name", language_positions) %>%
  dplyr::filter(!is.na(language_name))

For other reasons, I suggest adding , stringsAsFactors=F when you define d, but tidyr's separate functions don't seem to mind. The qplot call can remain the same: qplot(language_name, data=d_long).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.