How to do multiple sequence alignment of text strings (utf8) in R

Question

Given three strings:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

I would like to do multiple sequence alignment so that I get the following result:

abcd
 bcde
  cdef
a    f
  cd  ghi

Using the msa() function from the msa package I tried:

msa(seq, type = "protein", order = "input", method = "Muscle")

and got the following result:

    aln     names
 [1] ABCD--- Seq1
 [2] -BCDE-- Seq2
 [3] --CD-EF Seq3
 [4] -----AF Seq4
 [5] --CDGHI Seq5
 Con --CD-?? Consensus

I would like to use this function for sequences that can contain any unicode characters, but already in this example the function gives a warning: invalid letters found. Any ideas?

It's a good question, but your expected output isn't fully specified. What happens if a string contains no letters from within the previous string? What happens if it contains letters that were present in an earlier string but not the one immediately before? Should the order be fixed according to the input vector, or should it be changed to maximize alignment? What should the format of the output be? Should it be printed to screen, returned as a character vector, or a character scalar with new line characters in it? Details matter here. — Allan Cameron
– Allan Cameron, Commented May 25, 2022 at 12:39
You may also consider mafft --anysymbol. More info: mafft.cbrc.jp/alignment/software/anysymbol.html — Ghoti
– Ghoti, Commented May 26, 2022 at 20:39

Allan Cameron · Accepted Answer · 2022-05-25 19:39:36Z

3

Here's a solution in base R that outputs a table:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

all_chars <- unique(unlist(strsplit(seq, "")))

tab <- t(apply(do.call(rbind, lapply(strsplit(seq, ""), 
       function(x) table(factor(x, all_chars)))), 1,
       function(x) ifelse(x == 1, all_chars, " ")))

We can print the output without quotes to see it more clearly:

print(tab, quote = FALSE)
#>      a b c d e f g h i
#> [1,] a b c d          
#> [2,]   b c d e        
#> [3,]     c d e f      
#> [4,] a         f      
#> [5,]     c d     g h i

^{Created on 2022-05-25 by the reprex package (v2.0.1)}

answered May 25, 2022 at 19:39

Allan Cameron

178k7 gold badges70 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

WJH Over a year ago

How can I refer to this algorithm when mentioning this in a paper?

Allan Cameron Over a year ago

I think I would just say that "the sequences were matched by fitting them into a tabular array of all the unique characters in the entire set."

WJH Over a year ago

OK, thanks. Do you maybe also have a version that keeps the order of the letters within a string. Assume we add 'fa'. Then either the a or the f matches with a or f in other strings having a or f, but not both the a and the f can match. Or when having 'afa', the last a will not match with an a in any of the other strinigs.

Allan Cameron Over a year ago

@WJH how would you know which a to match in the 'afa' case? And suppose you had 'fab' - would you want the f to match on its own, with the ab sticking out at the end, or would you want the 'ab' to match, with the f sticking out on the left?

WJH Over a year ago

I would always add so that the number of matching letters is maximized. So in case of 'afa' I would match the first 'a' and the 'f', and the last 'a' sticking out on the right, and in case of 'fab' I would match 'a' and 'b' and sticking out 'f' on the 'left.'

|

WJH · Accepted Answer · 2022-05-27 21:16:45Z

1

A solution is to use LingPy. First install LingPy according to the instructions at: http://lingpy.org/tutorial/installation.html. Then run:

library(reticulate)

builtins <- import_builtins()
lingpy   <- import("lingpy")

seqs <- c("mɪlk","mɔˑlkə","mɛˑlək","mɪlɪx","mɑˑlʲk")

multi <- lingpy$Multiple(seqs)
multi$prog_align()
builtins$print(multi)

Output:

m   ɪ   l   -   k   -
m   ɔˑ  l   -   k   ə
m   ɛˑ  l   ə   k   -
m   ɪ   l   ɪ   x   -
m   ɑˑ  lʲ  -   k   -

edited May 27, 2022 at 21:16

answered May 27, 2022 at 19:03

WJH

5815 silver badges21 bronze badges

Collectives™ on Stack Overflow

How to do multiple sequence alignment of text strings (utf8) in R

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related