Splitting a dataframe string column into multiple columns without a pattern

Question

I have a data.frame with a column named "Extra" containing many information separated by ";". I only want to keep the part after including first word "MES".

> [1] 
IMPACT=MODIFIER;DISTANCE=3802;STRAND=1;MES-SWA_acceptor_alt=-1.269;MES-SWA_acceptor_diff=-4.016;MES-SWA_acceptor_ref=-5.005;MES-SWA_acceptor_ref_comp=-5.285;MES-SWA_donor_alt=-6.610;MES-SWA_donor_diff=0.781;MES-SWA_donor_ref=-1.165;MES-SWA_donor_ref_comp=-5.829

> [2] 
IMPACT=MODIFIER;STRAND=1;MES-SWA_acceptor_alt=0.965;MES-SWA_acceptor_diff=0.290;MES-SWA_acceptor_ref=1.255;MES-SWA_acceptor_ref_comp=1.255;MES-SWA_donor_alt=-9.796;MES-SWA_donor_diff=-1.219;MES-SWA_donor_ref=-10.341;MES-SWA_donor_ref_comp=-11.015

Splitting the information in several columns by ";" it's easy with the function "separate()". However, if I do so, because not all the rows contain exactly the same information (e.g: DISTANCE value is in the first example but not in the second), the columns' information get messed up and don't match their corresponding columns (see image). I think that's why I get a Warning message:

> df <- separate(tabla2, col = "Extra", c("IMPACT=MODIFIER", "DISTANCE", "STRAND", "MES-SWA_acceptor_alt", "MES-SWA_acceptor_diff", "MES-SWA_acceptor_ref", "MES-SWA_acceptor_ref_comp", "MES-SWA_donor_alt", "MES-SWA_donor_diff", "MES-SWA_donor_ref", "MES-SWA_donor_ref_comp"), sep = ";")

>Warning messages:
1: Expected 11 pieces. Additional pieces discarded in 23177 rows [2, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, ...]. 
2: Expected 11 pieces. Missing pieces filled with `NA` in 74 rows [1055, 1061, 1062, 1072, 1100, 1101, 1102, 1103, 1104, 1105, 1308, 1319, 1320, 1321, 2684, 2713, 2714, 10494, 10495, 10496, ...].

So, If I could just get rid off all the non-valuable data that precedes the information I want to keep, I'd be happy. However, all the functions I find (substring, substr, separate, nchar...) are not useful in my case because they need a start argument that it's not always the same in my data.

I think the closest I got to solve this problem was by combining the functions unlist(strsplit()) like this:

> tabla3 <- tabla2 %>% select(Extra, var_id)
> tabla4 <- unlist(strsplit(tabla2$Extra, "MES-SWA_acceptor_alt="))
> tabla5 <- bind_cols(tabla3, tabla4) --> Error: Argument 2 must have names

Could anyone help me out with this issue? I'd be so greatefull!

This is my first time posting so I hope everything is clear :)

Could you add expected output for those 2 items? Are trying to remove all before the first "MES" ? — zx8754
– zx8754, Commented Oct 26, 2020 at 21:07
Related, possible duplicate stackoverflow.com/q/59103539/680068 — zx8754
– zx8754, Commented Oct 26, 2020 at 22:19

Jaap · Accepted Answer · 2020-10-30 15:24:09Z

1

Using data.table, split on ";" to new columns, then reshape wide-to-long, then split on "=" to new columns, finally, reshape from long-to-wide. This will give us aligned column names even when the value is missing, for example, see DISTANCE, it is NA for the second row:

d <- data.table(Extra =  c("IMPACT=MODIFIER;DISTANCE=3802;STRAND=1;MES-SWA_acceptor_alt=-1.269;MES-SWA_acceptor_diff=-4.016;MES-SWA_acceptor_ref=-5.005;MES-SWA_acceptor_ref_comp=-5.285;MES-SWA_donor_alt=-6.610;MES-SWA_donor_diff=0.781;MES-SWA_donor_ref=-1.165;MES-SWA_donor_ref_comp=-5.829",
                           "IMPACT=MODIFIER;STRAND=1;MES-SWA_acceptor_alt=0.965;MES-SWA_acceptor_diff=0.290;MES-SWA_acceptor_ref=1.255;MES-SWA_acceptor_ref_comp=1.255;MES-SWA_donor_alt=-9.796;MES-SWA_donor_diff=-1.219;MES-SWA_donor_ref=-10.341;MES-SWA_donor_ref_comp=-11.015"))

d[, tstrsplit(Extra, ";")
  ][, id := .I
    ][, melt(.SD, id.vars = "id")
      ][, c("c1", "c2") := tstrsplit(value, "=", type.convert = TRUE)
        ][ , dcast(.SD, id ~ c1, value.var = "c2")]

#    id   NA DISTANCE   IMPACT MES-SWA_acceptor_alt MES-SWA_acceptor_diff
# 1:  1 <NA>     3802 MODIFIER               -1.269                -4.016
# 2:  2 <NA>     <NA> MODIFIER                0.965                 0.290
#    MES-SWA_acceptor_ref MES-SWA_acceptor_ref_comp MES-SWA_donor_alt
# 1:               -5.005                    -5.285            -6.610
# 2:                1.255                     1.255            -9.796
#    MES-SWA_donor_diff MES-SWA_donor_ref MES-SWA_donor_ref_comp STRAND
# 1:              0.781            -1.165                 -5.829      1
# 2:             -1.219           -10.341                -11.015      1

edited Oct 30, 2020 at 15:24

Jaap

83.6k36 gold badges190 silver badges203 bronze badges

answered Oct 26, 2020 at 22:05

zx8754

56.7k12 gold badges131 silver badges229 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mireia Boluda Over a year ago

Thanks so much! I tried this code and, if I create the same data table as you did, the outcome is the same as yours. But when I use my complete data frame, the outcome is wrong. The values for each column are not the original ones. I’m gonna show everything here out of curiosity. > #load files with MaxEnt SMD1070_S2 <- read_delim("C:\\Users\\MaxEnt_results\\SMD1070_S2.txt", "\t",na="") d <- as.data.table(SMD1070_S2) x <- data.table(d$Extra) names(x)[names(x) == "V1"] <- "Extra" (FOLLOWED BY YOUR CODE)

zx8754 Over a year ago

@MireiaBoluda then please provide example data that is representative of your real data.

Mireia Boluda Over a year ago

My data frame contains 93 variables and 26873 observations. But I am not interested in any of the other variables, but only in the one named "Extra", which contains the data I provided (I can't post any picture or paste a whole row due to characters limitation).

J. Ring · Accepted Answer · 2020-10-26 21:57:03Z

If I understood your desired output correctly then the following code should work for that:

# Given data example
tabla2 <- data.frame(Extra = c(
  "IMPACT=MODIFIER;DISTANCE=3802;STRAND=1;MES-SWA_acceptor_alt=-1.269;MES-SWA_acceptor_diff=-4.016;MES-SWA_acceptor_ref=-5.005;MES-SWA_acceptor_ref_comp=-5.285;MES-SWA_donor_alt=-6.610;MES-SWA_donor_diff=0.781;MES-SWA_donor_ref=-1.165;MES-SWA_donor_ref_comp=-5.829",
  "IMPACT=MODIFIER;STRAND=1;MES-SWA_acceptor_alt=0.965;MES-SWA_acceptor_diff=0.290;MES-SWA_acceptor_ref=1.255;MES-SWA_acceptor_ref_comp=1.255;MES-SWA_donor_alt=-9.796;MES-SWA_donor_diff=-1.219;MES-SWA_donor_ref=-10.341;MES-SWA_donor_ref_comp=-11.015"
  )
)
# Empty data frame
temp_df <- data.frame()
# Split Everything by ";"
temp_list <- strsplit(tabla2$Extra, split = ";")
# Cycle through elements to fill data frame
for (i in 1:length(temp_list)){
  temp_list_2 <- strsplit(temp_list[[i]], split = "=")
  for (j in 1:length(temp_list_2)){
    temp_df[i, temp_list_2[[j]][1]] <- temp_list_2[[j]][2]
  }  
}

Collectives™ on Stack Overflow

Splitting a dataframe string column into multiple columns without a pattern

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related