Python: Converting rows to column in a file

Question

I'm a python beginner trying to make a script that converts specified rows into columns using a tab delimited textfile as input. Here is an example of lines in the file:

1   chr1    1008376 1258657 250281  4628    666 2832    565 16.6323226376   83.3676773624
1   chr1    1258657 1516806 258149  2544    601 1481    231 13.4929906542   86.5070093458
1   chr1    1516806 1766886 250080  1652    590 936 63  6.30630630631   93.6936936937
1   chr1    1766886 2017159 250273  5030    1608    2698    362 11.8300653595   88.1699346405

Essentially the file goes through a list of regions (column 2-3) in the chromosome (column 1) of an individual (column 0) and gives a statistic calculated for that region (column 9). The file first lists all the regions for individual 1, then 2, onwards until the final individual. There are 20 individuals in the file. Id like a new file that does not include columns 0 or 4-8 and has new columns which are the scores for the region in that row ( now column 1-2) for each individual. So for individual 1 column 3 would now be what was previously column 9 column 4 would be the score for that region in invididual 2, and so on. So that each row has column 2 (chr1) as column 0 and the 20 columns after the region score (column 1-2) are the scores for each of the 20 individuals. Currently the scores are in rows, so the file has a lot of rows. Each individuals values in columns 1-3 are identical, so there is no issue of regions not overlapping. Also all individuals have the same number of rows. In other words columns 2+3 are duplicated 20 times in the file.

If that is too complicated/dense an explanation below is a stripped down example to illustrate the problem.

Here is a simple dummy example of what I would like:

Original file:

1 chr1 10 20 30423
1 chr1 20 30 40556
2 chr1 10 20 73476
2 chr1 20 30 43657
3 chr1 10 20 34656.5
3 chr1 20 30 90848

changed to:

chr1 10 20 30423 73476 34656.5
chr1 20 30 40556 43657 90848

So if any python users have some tips on converting rows to columns that would be really helpful even if you don't have the time to specifically solve this problem I'm finding row to column conversion to be a particularly tricky problem, especially when its conditional on the value in a column (here column 0).

Please let me know if I can clarify the problem. Any help or comments appreciated.

So update: thanks for all your comments, here is what I have come up with so far:

ListofData = [] # make list
individual=1 # only interested in first individual to get list of windows for the chromosome
for line in file('/mnt/genotyping/Alex/wholegenome/LROH/LROHSplitbyChrom/Filtered_by_MappingQuality20/SimpleHomozygosityScore/HomozygosityStatisticsTameratsalllanesMinMQ20chr20'): 
    line = line.rstrip() 
    fields = line.split("\t")
    if "chr" in line: #avoids header 
        if int(fields[0]) == individual:
            ListofData.extend(fields[2:5]) # add start, end and size of window to list

        else: # once iterated through windows, split the list into sets of three, making it one list per line
            lol = [ListofData[i:i+3] for i in range(0, len(ListofData), 3)] #list of lists divided into 3's

smallcounter = 0
for i in lol: #for set of 3 in list
    for line in file('/mnt/genotyping/Alex/wholegenome/LROH/LROHSplitbyChrom/Filtered_by_MappingQuality20/SimpleHomozygosityScore/HomozygosityStatisticsTameratsalllanesMinMQ20chr20'):
        if "chr" in line: # avoids header 
            line = line.rstrip() 
            fields = line.split("\t")
            if str(fields[2]) == lol.pop(0): #if start position in line matches start position in i
                i.extend(fields[9]) #add homozygosity score to list
                counter = counter + 1
            if smallcounter == 20: #if gone through all individuals in file
                smallcounter = 0 #reset counter for next try
                print i

I went through the file to get information I wanted in columns 2-4 and put it in a list. Then I broke this list up into groups of 3 which correspond to each line. Then in the second loop I am trying to say for each set of 3 in the list (so for each list in the list) go through the file and if the first position in the list is the same as the start position in the file (fields[2]) then add the score in fields[9] to that list. Then all I would need to do would be print the lists one after the other to get what I am after. However I am having difficulty for the line:

if str(fields[2]) == lol.pop(0):

I want python to look at the first position in the list, which is originally fields[2] and ask if that is the same as the fields[2] position in the line it is looping through. If it is then it should append fields[9] to the list.

Let me know if I need to explain that better.

Thank you very much in advance, your help is really appreciated!

As a suggestion, you might want to edit this post and show some work at the Python command line. In stackoverflow, showing some work you've done always helps. — octopusgrabbus
– octopusgrabbus, Commented Jun 10, 2012 at 14:08
Thanks I do try and do that when I have some script ive been working on but unfortunately here I don't know where to start. I've been using 'join' in bash for similar tasks before but that was merging data from separate files. Here reading from one file I'm not sure where to start. But as I work on it I'll post what I can come up with. Thanks — user964689
– user964689, Commented Jun 10, 2012 at 14:21
@user964689 I've edited my answer to take your comment into account. — octopusgrabbus
– octopusgrabbus, Commented Jun 10, 2012 at 14:27
there is no column 10, assuming that the first column is column 0 — user964689
– user964689, Commented Jun 10, 2012 at 16:01
so i had a go myself, not using the csv module yet, but i see that might be useful. Any help appreciated, it's still not totally complete — user964689
– user964689, Commented Jun 15, 2012 at 15:29

octopusgrabbus · Accepted Answer · 2012-06-10 14:27:02Z

4

It is difficult to start working with a new language, and you have to start somewhere. Fortunately, you've chosen Python, and you have a Python command line. Using that, you can test out how you would create columns, and so on.

First, you need to read in your input file, and process the information in each row. The Python CSV module is excellent. I've used it all over the place in a water utility project, and subsequently in a lot of other projects that required .csv processing.

But you have a tab delimited file. I have never tried setting the delimiter to tab and verified that worked with a tab-delimited file. If trying that does not work -- and you can test it at the Python command line -- as a workaround you could pipe the tab-delimited file into sed and convert the tabs to commas.

As to column, row representation, in Python you will have to have a list of lists. That is you will need to have [[1,2][3,4]...].

Lists in Python are mutable, so you can append to them. You would initialize your list of lists to an empty list

lol = []

Then you would need to add a list to lol depending on the number of columns you want across. Say you were putting together two-column rows with just numbers, as an exercise, you would do this:

lol.append([1,2])
lol.append([3,4])
lol.append([5,6])

>>> lol
[[1, 2], [3, 4], [5, 6]]

edited Jun 10, 2012 at 14:27

answered Jun 10, 2012 at 14:07

octopusgrabbus

10.7k15 gold badges75 silver badges137 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Martijn Pieters Over a year ago

The python csv module can work with tab-delimited files just fine. Take a look at the dialect section of the documentation for more info. It could be that the excel-tab pre-defined dialect works for the OP out of the box.

octopusgrabbus Over a year ago

@MartijnPieters Thanks for following up with this. I don't like recommending things I have not tried. All I deal with is .csv format, from property assessments to water reads.

user964689 Over a year ago

thanks guys, I will play around with this and let you know how it goes

user964689 Over a year ago

i had a go myself, added it to my question, any comments/edits most appreciated! thanks

malenkiy_scot · Accepted Answer · 2012-06-10 14:49:06Z

Here is some code to give you ideas on what can be done. I'll omit bells and whistles (for example, the three first if's can be done more gracefully in a loop; etc.) and present just bare-bones code. I'm reading from file 'chr.txt' and writing to stdout:

def readTabbedFile(filename):
    out = {}
    file = open(filename, 'r')
    for line in file.readlines():
        line = line.rstrip('\n\r')
        parsedLine = line.split('\t')
        if not parsedLine[1] in out:
            out[parsedLine[1]] = {}
        if not parsedLine[2] in out[parsedLine[1]]:
            out[parsedLine[1]][parsedLine[2]] = {}
        if not parsedLine[3] in out[parsedLine[1]][parsedLine[2]]:
            out[parsedLine[1]][parsedLine[2]][parsedLine[3]] = []

        out[parsedLine[1]][parsedLine[2]][parsedLine[3]].append(parsedLine[9])

    for key0 in out.keys():
        for key1 in out[key0].keys():
            for key2 in out[key0][key1].keys():
                outStr = key0 + "\t" + key1 + "\t" + key2 + "\t"
            for val in out[key0][key1][key2]:
                outStr += "\t" + val
                print(outStr)

    file.close()

if __name__ == '__main__':
    readTabbedFile("chr.txt")

lavee_singh · Accepted Answer · 2015-05-09 12:23:38Z

1

You can relate your problem with the list comprehension to convert rows into columns in a matrix.

enter image description here

answered May 9, 2015 at 12:23

lavee_singh

1,4671 gold badge14 silver badges21 bronze badges

Collectives™ on Stack Overflow

Python: Converting rows to column in a file

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related