0

I have a single field of semantic tags & semantic tag types. Each tag type/tag is comma-separated, while each tag type & tag are colon separated (see below).

ID | Semantic Tags

1  |   Person:mitch mcconnell, Person:ashley judd, Position:senator

2  |   Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 

3  |   Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 

4  |   Person:ashley judd, topicname:politics

5  |   URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc

I want to split each tag type (term before colon) & tag (term after colon) into two separate fields: "Tag Type" & "Tag". The resulting file should look something like this:

ID | Tag Type  |  Tag

1  |  Person   |  mitch McConnell

1  |  Person   |  ashley judd  

1  |  Position |  senator

2  |  Person   |  mitch McConnell

2  |  Position |  senator

2  |  State    |  kentucky

Here is the code I have so far...

tag<-strsplit(as.character(emtable$Symantic.Tags),","))
tagtype<-strsplit(as.character(tag),":")

But after that, I'm lost! I believe I need to use lapply or sapply for this, but am not sure where that plays in...

My apologies if this has been answered in other forms on the site -- I am new to R & this is still a bit complex for me.

Thanks in advance for anyone's help.

4
  • 1
    Could you please provide a reproducible example using dput(emtable) (or dput(head(emtable)) if that is too much data?) Commented Apr 9, 2013 at 15:03
  • I've reformatted the data to look like their tabular layout. Commented Apr 9, 2013 at 15:18
  • Why didn't you just use dput? It makes it easier on answerers Commented Apr 9, 2013 at 15:21
  • i am a bad person. and a bad r user. Commented Apr 14, 2013 at 2:08

2 Answers 2

4

This is another (slightly different) approach:

## dat <- readLines(n=5)
## Person:mitch mcconnell, Person:ashley judd, Position:senator
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
## Person:ashley judd, topicname:politics
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x))
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by /
dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)),
    do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))
)

colnames(dat3)[-1] <- c("Tag Type", "Tag")

##    ID        Tag Type                    Tag
## 1   1          Person        mitch mcconnell
## 2   1          Person            ashley judd
## 3   1        Position                senator
## 4   2          Person        mitch mcconnell
## 5   2        Position                senator
## 6   2 ProvinceOrState               kentucky
## 7   2       topicname               politics
## 8   3          Person        mitch mcconnell
## 9   3          Person            ashley judd
## 10  3    Organization                 senate
## 11  3    Organization             republican
## 12  4          Person            ashley judd
## 13  4       topicname               politics
## 14  5             URL www.huffingtonpost.com
## 15  5         Company              usa today
## 16  5          Person             chuck todd
## 17  5         Company                  msnbc

Thorough explanation:

## dat <- readLines(n=5)
## Person:mitch mcconnell, Person:ashley judd, Position:senator
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
## Person:ashley judd, topicname:politics
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x))
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by /

# Let the explanation begin...

# Here I have a short list of the variables from the rows
# of the original dataframe; in this case the row numbers:

seq_along(dat3)      #row variables

# then I use sapply and length to figure out hoe long the
# split variables in each row (now a list) are

sapply(dat3, length) #n times

# this tells me how many times to repeat the short list of 
# variables.  This is because I stretch the dat3 list to a vector
# Here I rep the row variables n times

rep(seq_along(dat3), sapply(dat3, length))

# better assign that for later:

ID <- rep(seq_along(dat3), sapply(dat3, length))

#============================================
# Now to explain the next chunk...
# I take dat3

dat3

# Each element in the list 1-5 is made of a new list of 
# Vectors of length 2 of Tag_Types and Tags.
# For instance here's element 5 a list of two  lists 
# with character vectors of length 2 

## [[5]]
## [[5]][[1]]
## [1] "URL"  "www.huffingtonpost.com"
## 
## [[5]][[2]]
## [1] "URL"  "http://www.regular-expressions.info"

# Use str to look at this structure:

dat3[[5]]
str(dat3[[5]])

## List of 2
##  $ : chr [1:2] "URL" "www.huffingtonpost.com"
##  $ : chr [1:2] "URL" "http://www.regular-expressions.info"

# I use lapply (list apply) to apply an anynomous function:
# function(x) do.call(rbind, x) 
#
# TO each of the 5 elements.  This basically glues the list 
# of vectors together to make a matrix.  Observe just on elenet 5:

do.call(rbind, dat3[[5]])

##      [,1]  [,2]                                 
## [1,] "URL" "www.huffingtonpost.com"             
## [2,] "URL" "http://www.regular-expressions.info"

# We use lapply to do that to all elements:

lapply(dat3, function(x) do.call(rbind, x))

# We then use the do.call(rbind on this list and we have a 
# matrix

do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))

# Let's assign that for later:

the_mat <- do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))

#============================================    
# Now we put it all together with data.frame:

data.frame(ID, the_mat)
Sign up to request clarification or add additional context in comments.

7 Comments

This seems to be doing the trick. However, when I run the third command: dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)), do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) ) I get the following message: Error in function (..., deparse.level = 1) : number of columns of matrices must match (see arg 2) In addition: There were 50 or more warnings (use warnings() to see the first 50)
This issue is specific to your data and it doesn't look like the data you've shown here. You can use debugging tools like debug to figure out the first issue and for the second I'd do as it says and use warnings() to see more specifically why you get the warnings you do.
yes, I saw that one of my tag types was URL, which frequently contained "http:" -- that ended up breaking the matrix into a non-uniform number of columns when splitting on ":". So I just added a line of code to remove the "http:", b/n the 1st & 2nd strsplit codes.
@Niu you figured it out but there's a regex that could have helped. See my edit and Josh's answer that this changed is based on.
sorry to re-open a closed case, but could you tell me how I edit the following line of code, dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)), do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) ), to include other variables that should be repeated down the sequence; such as date, source of post, etc. e.g. if ID 1 was published on 1/2/2012, I would want to see a Date field with 1/2/2012 for all of ID 1's records. I understand the technicality behind the line of code itself, but not the principle as to apply it elsewhere.
|
3
DF
##   ID                                                                                  Semantic.Tags
## 1  1                                   Person:mitch mcconnell, Person:ashley judd, Position:senator
## 2  2        Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## 3  3      Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## 4  4                                                         Person:ashley judd, topicname:politics
## 5  5                URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc


ll <- lapply(strsplit(DF$Semantic.Tags, ","), strsplit, split = ":")

f <- function(x) do.call(rbind, x)

f(lapply(ll, f))
##       [,1]               [,2]                    
##  [1,] "     Person"      "mitch mcconnell"       
##  [2,] " Person"          "ashley judd"           
##  [3,] " Position"        "senator"               
##  [4,] "     Person"      "mitch mcconnell"       
##  [5,] " Position"        "senator"               
##  [6,] " ProvinceOrState" "kentucky"              
##  [7,] " topicname"       "politics "             
##  [8,] "     Person"      "mitch mcconnell"       
##  [9,] " Person"          "ashley judd"           
## [10,] " Organization"    "senate"                
## [11,] " Organization"    "republican "           
## [12,] "     Person"      "ashley judd"           
## [13,] " topicname"       "politics"              
## [14,] "     URL"         "www.huffingtonpost.com"
## [15,] " Company"         "usa today"             
## [16,] " Person"          "chuck todd"            
## [17,] " Company"         "msnbc"                 

3 Comments

(+1) alternatively matrix(rapply(ll, rbind), ncol = 2, byrow = TRUE) for the last two steps.
or more transparently: matrix(rapply(ll, identity), ncol = 2, byrow = TRUE)
Thanks guys, I actually used a combination of code from the above three methods. Ended up working.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.