I`m a biologist and I am trying to solve this problem, please guide me on where to start. I do not know which forum to post this in, please let me know if this place is not appropriate.
I have blocks of values which can be strictly either of two sources or a mixture of the sources.
'source1', 'source2' and 'mixture' are keywords in the real data.
The source values are limited within the set { AA, TT, GG , CC }
The mixture values are limited within the set { AA , TT , GG , CC , AT , AG, AC , TG , TC , GC }
but the mixture values are dependent on their source within the same block.
So if within blockN,
source1 =XX where X~{A,T,G,C}
source2 =YY where Y~{A,T,G,C}
then mixture values have to be among { XX, YY, XY }
Sometimes, either or both sources are missing from my data, in that case I want to insert the missing source values,
If within a block, Source1 is missing , Source2 is XX, and one of the mixture values is YY, then we know Source1 is YY.
Another example is if in a block, Source1 is missing , Source2 is XX, and one of the mixture values is XY, then Source1 is YY.
As you can see, there are te above 2 ways of knowing the source depending on what is present in the mixture set.
There can be cases where both sources are absent, but there are mixture values XY in the block. This tells us Source1 and Source2
are XX and YY (or YY and XX , order matters not).
If my example data is
block1 source1 AA
block1 source2 TT
block1 mixture AT
block1 mixture AA
block1 mixture TT
block2 source1 GG
block2 source2 TT
block2 mixture TG
block2 mixture TG
block2 mixture TT
block3 source1 AA
block3 source2 TT
block3 mixture AT
block3 mixture AA
block3 mixture TT
block4 mixture AT
block4 mixture AA
block4 mixture TT
block5 source2 TT
block5 mixture TG
block5 mixture TG
The output which I want is
block1 source1 AA
block1 source2 TT
block1 mixture AT
block1 mixture AA
block1 mixture TT
block2 source1 GG
block2 source2 TT
block2 mixture TG
block2 mixture TG
block2 mixture TT
block3 source1 AA
block3 source2 TT
block3 mixture AT
block3 mixture AA
block3 mixture TT
block4 source1 AA
block4 source2 TT
block4 mixture AT
block4 mixture AA
block4 mixture TT
block5 source1 GG
block5 source2 TT
block5 mixture TG
block5 mixture TG
Please note the insertions in blocks 4 and 5. I have separated the blocks for ease of understanding; in the real data they are not separated by blank lines.
AAandTTbelongs tomixturebut it is probably not important. So from a programming perspective: It shall be determined whether two different chars occur as mixture data. If so then source1 and source2 are to be added each being these two chars twice.