R: split string & assign variable based on split

Question

I have a single field of semantic tags & semantic tag types. Each tag type/tag is comma-separated, while each tag type & tag are colon separated (see below).

ID | Semantic Tags

1  |   Person:mitch mcconnell, Person:ashley judd, Position:senator

2  |   Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 

3  |   Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 

4  |   Person:ashley judd, topicname:politics

5  |   URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc

I want to split each tag type (term before colon) & tag (term after colon) into two separate fields: "Tag Type" & "Tag". The resulting file should look something like this:

ID | Tag Type  |  Tag

1  |  Person   |  mitch McConnell

1  |  Person   |  ashley judd  

1  |  Position |  senator

2  |  Person   |  mitch McConnell

2  |  Position |  senator

2  |  State    |  kentucky

Here is the code I have so far...

tag<-strsplit(as.character(emtable$Symantic.Tags),","))
tagtype<-strsplit(as.character(tag),":")

But after that, I'm lost! I believe I need to use lapply or sapply for this, but am not sure where that plays in...

My apologies if this has been answered in other forms on the site -- I am new to R & this is still a bit complex for me.

Thanks in advance for anyone's help.

Could you please provide a reproducible example using dput(emtable) (or dput(head(emtable)) if that is too much data?) — David Robinson
– David Robinson, Commented Apr 9, 2013 at 15:03
I've reformatted the data to look like their tabular layout. — NiuBiBang
– NiuBiBang, Commented Apr 9, 2013 at 15:18
Why didn't you just use dput? It makes it easier on answerers — David Robinson
– David Robinson, Commented Apr 9, 2013 at 15:21

Tyler Rinker · Accepted Answer · 2013-04-17 20:17:32Z

4

This is another (slightly different) approach:

## dat <- readLines(n=5)
## Person:mitch mcconnell, Person:ashley judd, Position:senator
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
## Person:ashley judd, topicname:politics
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x))
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by /
dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)),
    do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))
)

colnames(dat3)[-1] <- c("Tag Type", "Tag")

##    ID        Tag Type                    Tag
## 1   1          Person        mitch mcconnell
## 2   1          Person            ashley judd
## 3   1        Position                senator
## 4   2          Person        mitch mcconnell
## 5   2        Position                senator
## 6   2 ProvinceOrState               kentucky
## 7   2       topicname               politics
## 8   3          Person        mitch mcconnell
## 9   3          Person            ashley judd
## 10  3    Organization                 senate
## 11  3    Organization             republican
## 12  4          Person            ashley judd
## 13  4       topicname               politics
## 14  5             URL www.huffingtonpost.com
## 15  5         Company              usa today
## 16  5          Person             chuck todd
## 17  5         Company                  msnbc

Thorough explanation:

## dat <- readLines(n=5)
## Person:mitch mcconnell, Person:ashley judd, Position:senator
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
## Person:ashley judd, topicname:politics
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x))
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by /

# Let the explanation begin...

# Here I have a short list of the variables from the rows
# of the original dataframe; in this case the row numbers:

seq_along(dat3)      #row variables

# then I use sapply and length to figure out hoe long the
# split variables in each row (now a list) are

sapply(dat3, length) #n times

# this tells me how many times to repeat the short list of 
# variables.  This is because I stretch the dat3 list to a vector
# Here I rep the row variables n times

rep(seq_along(dat3), sapply(dat3, length))

# better assign that for later:

ID <- rep(seq_along(dat3), sapply(dat3, length))

#============================================
# Now to explain the next chunk...
# I take dat3

dat3

# Each element in the list 1-5 is made of a new list of 
# Vectors of length 2 of Tag_Types and Tags.
# For instance here's element 5 a list of two  lists 
# with character vectors of length 2 

## [[5]]
## [[5]][[1]]
## [1] "URL"  "www.huffingtonpost.com"
## 
## [[5]][[2]]
## [1] "URL"  "http://www.regular-expressions.info"

# Use str to look at this structure:

dat3[[5]]
str(dat3[[5]])

## List of 2
##  $ : chr [1:2] "URL" "www.huffingtonpost.com"
##  $ : chr [1:2] "URL" "http://www.regular-expressions.info"

# I use lapply (list apply) to apply an anynomous function:
# function(x) do.call(rbind, x) 
#
# TO each of the 5 elements.  This basically glues the list 
# of vectors together to make a matrix.  Observe just on elenet 5:

do.call(rbind, dat3[[5]])

##      [,1]  [,2]                                 
## [1,] "URL" "www.huffingtonpost.com"             
## [2,] "URL" "http://www.regular-expressions.info"

# We use lapply to do that to all elements:

lapply(dat3, function(x) do.call(rbind, x))

# We then use the do.call(rbind on this list and we have a 
# matrix

do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))

# Let's assign that for later:

the_mat <- do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))

#============================================    
# Now we put it all together with data.frame:

data.frame(ID, the_mat)

edited Apr 17, 2013 at 20:17

answered Apr 9, 2013 at 15:29

Tyler Rinker

111k74 gold badges335 silver badges534 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

NiuBiBang Over a year ago

This seems to be doing the trick. However, when I run the third command:

dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)),     do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) )

I get the following message: Error in function (..., deparse.level = 1) : number of columns of matrices must match (see arg 2) In addition: There were 50 or more warnings (use warnings() to see the first 50)

Tyler Rinker Over a year ago

This issue is specific to your data and it doesn't look like the data you've shown here. You can use debugging tools like debug to figure out the first issue and for the second I'd do as it says and use warnings() to see more specifically why you get the warnings you do.

NiuBiBang Over a year ago

yes, I saw that one of my tag types was URL, which frequently contained "http:" -- that ended up breaking the matrix into a non-uniform number of columns when splitting on ":". So I just added a line of code to remove the "http:", b/n the 1st & 2nd strsplit codes.

Tyler Rinker Over a year ago

@Niu you figured it out but there's a regex that could have helped. See my edit and Josh's answer that this changed is based on.

NiuBiBang Over a year ago

sorry to re-open a closed case, but could you tell me how I edit the following line of code,

dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)), do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) )

, to include other variables that should be repeated down the sequence; such as date, source of post, etc. e.g. if ID 1 was published on 1/2/2012, I would want to see a Date field with 1/2/2012 for all of ID 1's records. I understand the technicality behind the line of code itself, but not the principle as to apply it elsewhere.

|

CHP · Accepted Answer · 2013-04-09 15:18:51Z

3

DF
##   ID                                                                                  Semantic.Tags
## 1  1                                   Person:mitch mcconnell, Person:ashley judd, Position:senator
## 2  2        Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## 3  3      Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## 4  4                                                         Person:ashley judd, topicname:politics
## 5  5                URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc


ll <- lapply(strsplit(DF$Semantic.Tags, ","), strsplit, split = ":")

f <- function(x) do.call(rbind, x)

f(lapply(ll, f))
##       [,1]               [,2]                    
##  [1,] "     Person"      "mitch mcconnell"       
##  [2,] " Person"          "ashley judd"           
##  [3,] " Position"        "senator"               
##  [4,] "     Person"      "mitch mcconnell"       
##  [5,] " Position"        "senator"               
##  [6,] " ProvinceOrState" "kentucky"              
##  [7,] " topicname"       "politics "             
##  [8,] "     Person"      "mitch mcconnell"       
##  [9,] " Person"          "ashley judd"           
## [10,] " Organization"    "senate"                
## [11,] " Organization"    "republican "           
## [12,] "     Person"      "ashley judd"           
## [13,] " topicname"       "politics"              
## [14,] "     URL"         "www.huffingtonpost.com"
## [15,] " Company"         "usa today"             
## [16,] " Person"          "chuck todd"            
## [17,] " Company"         "msnbc"

answered Apr 9, 2013 at 15:18

CHP

17.2k4 gold badges42 silver badges59 bronze badges

3 Comments

Henrik Over a year ago

(+1) alternatively matrix(rapply(ll, rbind), ncol = 2, byrow = TRUE) for the last two steps.

Henrik Over a year ago

or more transparently: matrix(rapply(ll, identity), ncol = 2, byrow = TRUE)

NiuBiBang Over a year ago

Thanks guys, I actually used a combination of code from the above three methods. Ended up working.

Collectives™ on Stack Overflow

R: split string & assign variable based on split

2 Answers 2

7 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related