2

Given three strings:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

I would like to do multiple sequence alignment so that I get the following result:

abcd
 bcde
  cdef
a    f
  cd  ghi

Using the msa() function from the msa package I tried:

msa(seq, type = "protein", order = "input", method = "Muscle")

and got the following result:

    aln     names
 [1] ABCD--- Seq1
 [2] -BCDE-- Seq2
 [3] --CD-EF Seq3
 [4] -----AF Seq4
 [5] --CDGHI Seq5
 Con --CD-?? Consensus   

I would like to use this function for sequences that can contain any unicode characters, but already in this example the function gives a warning: invalid letters found. Any ideas?

3
  • It's a good question, but your expected output isn't fully specified. What happens if a string contains no letters from within the previous string? What happens if it contains letters that were present in an earlier string but not the one immediately before? Should the order be fixed according to the input vector, or should it be changed to maximize alignment? What should the format of the output be? Should it be printed to screen, returned as a character vector, or a character scalar with new line characters in it? Details matter here. Commented May 25, 2022 at 12:39
  • Sure, I added the details as far as I could. Commented May 25, 2022 at 19:25
  • You may also consider mafft --anysymbol. More info: mafft.cbrc.jp/alignment/software/anysymbol.html Commented May 26, 2022 at 20:39

2 Answers 2

3

Here's a solution in base R that outputs a table:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

all_chars <- unique(unlist(strsplit(seq, "")))

tab <- t(apply(do.call(rbind, lapply(strsplit(seq, ""), 
       function(x) table(factor(x, all_chars)))), 1,
       function(x) ifelse(x == 1, all_chars, " ")))

We can print the output without quotes to see it more clearly:

print(tab, quote = FALSE)
#>      a b c d e f g h i
#> [1,] a b c d          
#> [2,]   b c d e        
#> [3,]     c d e f      
#> [4,] a         f      
#> [5,]     c d     g h i

Created on 2022-05-25 by the reprex package (v2.0.1)

Sign up to request clarification or add additional context in comments.

7 Comments

How can I refer to this algorithm when mentioning this in a paper?
I think I would just say that "the sequences were matched by fitting them into a tabular array of all the unique characters in the entire set."
OK, thanks. Do you maybe also have a version that keeps the order of the letters within a string. Assume we add 'fa'. Then either the a or the f matches with a or f in other strings having a or f, but not both the a and the f can match. Or when having 'afa', the last a will not match with an a in any of the other strinigs.
@WJH how would you know which a to match in the 'afa' case? And suppose you had 'fab' - would you want the f to match on its own, with the ab sticking out at the end, or would you want the 'ab' to match, with the f sticking out on the left?
I would always add so that the number of matching letters is maximized. So in case of 'afa' I would match the first 'a' and the 'f', and the last 'a' sticking out on the right, and in case of 'fab' I would match 'a' and 'b' and sticking out 'f' on the 'left.'
|
1

A solution is to use LingPy. First install LingPy according to the instructions at: http://lingpy.org/tutorial/installation.html. Then run:

library(reticulate)

builtins <- import_builtins()
lingpy   <- import("lingpy")

seqs <- c("mɪlk","mɔˑlkə","mɛˑlək","mɪlɪx","mɑˑlʲk")

multi <- lingpy$Multiple(seqs)
multi$prog_align()
builtins$print(multi)

Output:

m   ɪ   l   -   k   -
m   ɔˑ  l   -   k   ə
m   ɛˑ  l   ə   k   -
m   ɪ   l   ɪ   x   -
m   ɑˑ  lʲ  -   k   -

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.