I'm a python beginner trying to make a script that converts specified rows into columns using a tab delimited textfile as input. Here is an example of lines in the file:
1 chr1 1008376 1258657 250281 4628 666 2832 565 16.6323226376 83.3676773624
1 chr1 1258657 1516806 258149 2544 601 1481 231 13.4929906542 86.5070093458
1 chr1 1516806 1766886 250080 1652 590 936 63 6.30630630631 93.6936936937
1 chr1 1766886 2017159 250273 5030 1608 2698 362 11.8300653595 88.1699346405
Essentially the file goes through a list of regions (column 2-3) in the chromosome (column 1) of an individual (column 0) and gives a statistic calculated for that region (column 9). The file first lists all the regions for individual 1, then 2, onwards until the final individual. There are 20 individuals in the file. Id like a new file that does not include columns 0 or 4-8 and has new columns which are the scores for the region in that row ( now column 1-2) for each individual. So for individual 1 column 3 would now be what was previously column 9 column 4 would be the score for that region in invididual 2, and so on. So that each row has column 2 (chr1) as column 0 and the 20 columns after the region score (column 1-2) are the scores for each of the 20 individuals. Currently the scores are in rows, so the file has a lot of rows. Each individuals values in columns 1-3 are identical, so there is no issue of regions not overlapping. Also all individuals have the same number of rows. In other words columns 2+3 are duplicated 20 times in the file.
If that is too complicated/dense an explanation below is a stripped down example to illustrate the problem.
Here is a simple dummy example of what I would like:
Original file:
1 chr1 10 20 30423
1 chr1 20 30 40556
2 chr1 10 20 73476
2 chr1 20 30 43657
3 chr1 10 20 34656.5
3 chr1 20 30 90848
changed to:
chr1 10 20 30423 73476 34656.5
chr1 20 30 40556 43657 90848
So if any python users have some tips on converting rows to columns that would be really helpful even if you don't have the time to specifically solve this problem I'm finding row to column conversion to be a particularly tricky problem, especially when its conditional on the value in a column (here column 0).
Please let me know if I can clarify the problem. Any help or comments appreciated.
So update: thanks for all your comments, here is what I have come up with so far:
ListofData = [] # make list
individual=1 # only interested in first individual to get list of windows for the chromosome
for line in file('/mnt/genotyping/Alex/wholegenome/LROH/LROHSplitbyChrom/Filtered_by_MappingQuality20/SimpleHomozygosityScore/HomozygosityStatisticsTameratsalllanesMinMQ20chr20'):
line = line.rstrip()
fields = line.split("\t")
if "chr" in line: #avoids header
if int(fields[0]) == individual:
ListofData.extend(fields[2:5]) # add start, end and size of window to list
else: # once iterated through windows, split the list into sets of three, making it one list per line
lol = [ListofData[i:i+3] for i in range(0, len(ListofData), 3)] #list of lists divided into 3's
smallcounter = 0
for i in lol: #for set of 3 in list
for line in file('/mnt/genotyping/Alex/wholegenome/LROH/LROHSplitbyChrom/Filtered_by_MappingQuality20/SimpleHomozygosityScore/HomozygosityStatisticsTameratsalllanesMinMQ20chr20'):
if "chr" in line: # avoids header
line = line.rstrip()
fields = line.split("\t")
if str(fields[2]) == lol.pop(0): #if start position in line matches start position in i
i.extend(fields[9]) #add homozygosity score to list
counter = counter + 1
if smallcounter == 20: #if gone through all individuals in file
smallcounter = 0 #reset counter for next try
print i
I went through the file to get information I wanted in columns 2-4 and put it in a list. Then I broke this list up into groups of 3 which correspond to each line. Then in the second loop I am trying to say for each set of 3 in the list (so for each list in the list) go through the file and if the first position in the list is the same as the start position in the file (fields[2]) then add the score in fields[9] to that list. Then all I would need to do would be print the lists one after the other to get what I am after. However I am having difficulty for the line:
if str(fields[2]) == lol.pop(0):
I want python to look at the first position in the list, which is originally fields[2] and ask if that is the same as the fields[2] position in the line it is looping through. If it is then it should append fields[9] to the list.
Let me know if I need to explain that better.
Thank you very much in advance, your help is really appreciated!
