4

I`m a biologist and I am trying to solve this problem, please guide me on where to start. I do not know which forum to post this in, please let me know if this place is not appropriate.

I have blocks of values which can be strictly either of two sources or a mixture of the sources.

'source1', 'source2' and 'mixture' are keywords in the real data.

The source values are limited within the set { AA, TT, GG , CC }

The mixture values are limited within the set { AA , TT , GG , CC , AT , AG, AC , TG , TC , GC } but the mixture values are dependent on their source within the same block. So if within blockN,

 source1 =XX  where X~{A,T,G,C}



source2 =YY  where Y~{A,T,G,C}

then mixture values have to be among { XX, YY, XY }

Sometimes, either or both sources are missing from my data, in that case I want to insert the missing source values,

If within a block, Source1 is missing , Source2 is XX, and one of the mixture values is YY, then we know Source1 is YY. Another example is if in a block, Source1 is missing , Source2 is XX, and one of the mixture values is XY, then Source1 is YY. As you can see, there are te above 2 ways of knowing the source depending on what is present in the mixture set.

There can be cases where both sources are absent, but there are mixture values XY in the block. This tells us Source1 and Source2 are XX and YY (or YY and XX , order matters not).

If my example data is

block1 source1 AA 
block1 source2 TT
block1 mixture AT
block1 mixture AA
block1 mixture TT

block2 source1 GG
block2 source2 TT
block2 mixture TG
block2 mixture TG
block2 mixture TT

block3 source1 AA
block3 source2 TT
block3 mixture AT
block3 mixture AA
block3 mixture TT

block4 mixture AT
block4 mixture AA
block4 mixture TT

block5 source2 TT
block5 mixture TG
block5 mixture TG

The output which I want is

block1 source1 AA 
block1 source2 TT
block1 mixture AT
block1 mixture AA
block1 mixture TT

block2 source1 GG
block2 source2 TT
block2 mixture TG
block2 mixture TG
block2 mixture TT

block3 source1 AA
block3 source2 TT
block3 mixture AT
block3 mixture AA
block3 mixture TT

block4 source1 AA
block4 source2 TT
block4 mixture AT
block4 mixture AA
block4 mixture TT


block5 source1 GG
block5 source2 TT
block5 mixture TG
block5 mixture TG

Please note the insertions in blocks 4 and 5. I have separated the blocks for ease of understanding; in the real data they are not separated by blank lines.

2
  • I don't understand why AA and TT belongs to mixture but it is probably not important. So from a programming perspective: It shall be determined whether two different chars occur as mixture data. If so then source1 and source2 are to be added each being these two chars twice. Commented Dec 9, 2014 at 19:39
  • AA ,TT and AT are legitimate values in the mixture, if the sources are AA and TT. From a programming perspective there are 2 ways to determine the missing source. 1) If the mixture has both AA and TT, and one of the sources is AA, then the other source is TT. 2) If one or more of the mixture value is two different characters as you mentioned, like AT, and one of the sources is AA, then the other is TT. So the missing source can be determined from either of the two points above, of course if such data is available. Please let me know if I could help clarify. Commented Dec 9, 2014 at 21:07

1 Answer 1

2

This feels like it could be done in a simpler way, but the best I can come up with after an hour of head-scratching is this python script:

#! /usr/bin/env python3

import sys, os

class Block:
    block_id = ''
    source1 = ''
    source2 = ''
    mixtures = []
    def __init__(self, block_id = '', source1 = '', source2 = '', mixtures = []):
        self.block_id = block_id
        self.mixtures = mixtures
        self.source1 = source1
        self.source2 = source2

        # Convert mixtures to a set of characters. For example, 
        #     ''.join(["AT", "TT"]) 
        # creates the string "ATTT". set() then converts that string to 
        # a set of characters {'A', 'T'}
        sources = set(''.join(mixtures))

        # If a source is empty, we take from the set the first element (pop()) 
        # after removing the other source (difference()). Since the set 
        # contains single characters, we double it to get "AA", "TT", etc.
        if self.source1 == '':
            self.source1 = sources.difference(set(self.source2)).pop()*2
        sources.remove (self.source1[0])
        if self.source2 == '':
            self.source2 = sources.pop()*2

    def print (self):
        print (self.block_id, "source1", self.source1)
        print (self.block_id, "source2", self.source2)
        for mix in self.mixtures:
            print (self.block_id, "mixture", mix)

if len(sys.argv) == 1:
    files = [os.stdin]
else:
    files = (open(f) for f in sys.argv[1:])

for f in files:
    # Read in all the lines
    data = [line for rawline in f for line in [rawline.strip().split(' ')]]
    # Get the unique block IDs
    blocks = set (lines[0] for line in data)
    # For each block ID
    for b in blocks:
        # The corresponding mixtures
        mix = [line[2] for line in data if line[0] == b and "mixture" == line[1]]

        # If "source1 XX" is present, we will get the list ['XX'], and [] if 
        # source1 is not present. ''.join() allows us to flatten ['XX']  to 
        # just 'XX' (and doesn't affect []). Similarly for source2.
        source1 = ''.join(d[2] for line in data if line[0] == b and "source1" == line[1])
        source2 = ''.join(d[2] for line in data if line[0] == b and "source2" == line[1])

        # Create an object of the class defined above, and print it.
        # Initialization fills up the blank values.
        Block(b, source1, source2, mix).print()

Even then, this will provide random, out-of-order output (i.e., block3 data may come before block1, etc.).

Save this in a script (say, insert.py) and run:

python3 insert.py inputfile

I rewrote this in awk:

#! /usr/bin/awk -f

function build (block, source1, source2, sources, mixtures)
{       
    if (! source1)
    {
        for (char in sources)
        {
            if (source2 != char char)
            {
                source1 = char char
                delete sources[char]
                break
            }
        }
    }
    if (! source2)
    {   
        for (char in sources)
        {
            if (source1 != char char)
            {
                source2 = char char
                delete sources[char]
                break
            }
        }
    }
    printf "%s %s %s\n", block, "source1", source1
    printf "%s %s %s\n", block, "source2", source2
    for (m in mixtures)
    {
        for (i = 0; i < mixtures[m]; i++)
        {
            printf "%s %s %s\n", block, "mixture", m
        }
    }

}

{
    if (prev != $1)
    {
        if (prev in data)
        {
            build(prev, source1, source2, sources, mixtures)
        }

        prev = $1
        source1 = ""
        source2 = ""
        delete sources
        delete mixtures
    }

    data[$1]++
    if ($2 == "source1") {source1 = $3; next}
    if ($2 == "source2") {source2 = $3; next}
    if ($2 == "mixture")
    {
        mixtures[$3]++ 
        split ($3, chars, "")
        for (i=1; i <= length($3); i++)
        {
            sources[chars[i]]++
        }
    }
}

END { build(prev, source1, source2, sources, mixtures) }

Save this in a script (say insert.awk), chmod +x it, and run it:

./insert.awk inputfile

Now it should retain the order as well. Note that I have used delete, which may not be present in some awks (but should be in GNU awk, and mawk).

5
  • Thanks Muru, Im having trouble compiling this on Python 2.6.6, errors out on the def : print line 31. is there an alternative syntax for python 2.6.6 ? sorry, i`m doing this for the first time. Commented Dec 10, 2014 at 3:13
  • @PanJian I don't know anything 2.6 syntax, sorry. :( I rewrote it in awk, so you could try that one. Commented Dec 10, 2014 at 10:10
  • Thanks a lot Muru, its working well within chunks of data, in the real data my groups are scattered all over the file, is there some sorting that needs to be done before the awk script? Commented Dec 10, 2014 at 16:06
  • I think It works fine, if I sort on the 1st col before, thanks you are fabulous!!! Commented Dec 10, 2014 at 16:14
  • @PanJian Ah, yes. I assumed: 1. for an N, all the blockN lines appear together, and 2. All the mixture lines appear after the source lines for a block. Commented Dec 10, 2014 at 16:29

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.