-1

I have 2 big text files like the following small examples. There are two files (major and minor). In both major and minor files there are 4 columns. In the major file the difference between 2nd and 3rd columns is 10000 and the difference between 2nd and 3rd columns is 32 or 31 or a number close to 31 but not so high.

small example of major file:

chr4    530000  540000  0.0
chr4    540000  550000  1719.0
chr4    550000  560000  0.0

small example of minor file:

chr4    295577  295608  12
chr4    323326  323357  10
chr4    548873  548904  32
chr4    548873  548904  20
chr4    549047  549078  32
chr4    549047  549078  20
chr4    549137  549168  32
chr4    549137  549168  20
chr4    549181  549212  32
chr4    549181  549212  20
chr4    549269  549300  22
chr4    549269  549300  381
chr4    549269  549300  67
chr4    549269  549300  89
chr4    549269  549300  95
chr4    549269  549300  124
chr4    549269  549300  149
chr4    549269  549300  87
chr4    549269  549300  33
chr4    549269  549300  65
chr4    549269  549300  68
chr4    549269  549300  190
chr4    549269  549300  20
chr4    549355  549386  32
chr4    549355  549386  20
chr4    549443  549474  16
chr4    705810  705841  10
chr4    846893  846924  28

I want to make a new text file in which there would be 4 columns. Like the expected output:

chr4    548873  548904  32  chr4    540000  550000
chr4    548873  548904  20  chr4    540000  550000
chr4    549047  549078  32  chr4    540000  550000
chr4    549047  549078  20  chr4    540000  550000
chr4    549137  549168  32  chr4    540000  550000
chr4    549137  549168  20  chr4    540000  550000
chr4    549181  549212  32  chr4    540000  550000
chr4    549181  549212  20  chr4    540000  550000
chr4    549269  549300  22  chr4    540000  550000
chr4    549269  549300  381 chr4    540000  550000
chr4    549269  549300  67  chr4    540000  550000
chr4    549269  549300  89  chr4    540000  550000
chr4    549269  549300  95  chr4    540000  550000
chr4    549269  549300  124 chr4    540000  550000
chr4    549269  549300  149 chr4    540000  550000
chr4    549269  549300  87  chr4    540000  550000
chr4    549269  549300  33  chr4    540000  550000
chr4    549269  549300  65  chr4    540000  550000
chr4    549269  549300  68  chr4    540000  550000
chr4    549269  549300  190 chr4    540000  550000
chr4    549269  549300  20  chr4    540000  550000
chr4    549355  549386  32  chr4    540000  550000
chr4    549355  549386  20  chr4    540000  550000
chr4    549443  549474  16  chr4    540000  550000

The first 4 columns are from the minor file and the last 3 columns are from the major file.

The number in the 2nd and 3rd columns (from minor file) are in the range of the same row but columns 6 and 7 (from major file) and 1st column is equal to the 5th column (in fact the 1st columns of both major and minor files).

I want to look for the rows in minor file in which the first column is equal to the 1st column of major file, also 2nd and 3rd columns of the same row (in minor file) must be in a range of 2nd and the 3rd columns in the major file.

So in fact there are 3 conditions for every row in the minor file to be eligible to be included in the output file.

The last 3 columns are from the major file which fit the rows from minor file.

I am trying to do that in Python and have made the following code, but it does not return what I expected:

major = open("major.txt", 'rb')
minor = open("minor.txt", 'rb')
major_list = []
minor_list = []
for m in major:
    major_list.append(m)

for n in minor:
    minor_list.append(n)

final = []
for i in minor_list:
    for j in major_list
        if minor_list[i] == major_list[j] and minor_list[i+1] <= major_list[j+1] and minor_list[i+2] >= major_list[j+2]:
        final.append(i)


with open('output.txt', 'w') as f:
    for item in final:
        f.write("%s\n" % item)
7
  • 1
    Can you split up your 3 conditions into bullet points so its easier to understand? Having a bit of a tough time trying to see the requirements for the output Commented Oct 25, 2018 at 21:16
  • 1
    Can you show a different example that better demonstrates the requirements? According to the example you've shown you could just append chr4 540000 550000 to every line from the "minor" file... Commented Oct 25, 2018 at 21:19
  • What is the condition that selects 540000 550000 for output? From the example, as pointed out, one could just append every line. Commented Oct 25, 2018 at 21:21
  • 1
    These are .bed files. Looks like you're annotating probe sequences with genome ROIs. Have you tried intersecting the files via bedtools? Commented Oct 25, 2018 at 21:31
  • 3
    I'm voting to close this question as off-topic because EXACT DUP of stackoverflow.com/questions/52990973/… Commented Nov 1, 2018 at 15:18

3 Answers 3

2

These are .bed files and what you’re trying to do is called an intersection.

If you're on a linux or mac, or if you have access to one, you could install bedtools, which is worth it because you can do all of this in one line of code:

bedtools intersect -wa -wb -a minor_file -b major_file > new_text_file

In fact, interval intersections is exactly why bedtools was developed.

There is a python distribution of bedtools called 'pybedtools', but it's also only available for mac and linux, so I don’t think there’s much advantage to doing this in Python.

You could of course do everything in Python, but if you’re going to be doing any amount of bioinformatics, bedtools and GATK (genome analysis tool kit, from the Broad Institute, command-line interface only) are good reasons to do some things in shell. Also, you’ll need to sort your intervals at some point, so downstream manipulations work and don’t take forever. For that, it’s just all too quick to use the shell command ‘sort’ (sort -k1,1V -k2,2n -k3,3n [your_file] > [new_sorted_file]).

But GATK and the genome-specific tools (those requiring an indexed genome sequence file) in bedtools are the big reasons to do some things in shell. Similar to the bedtools intersect command, implementing that functionality in Python would require writing and debugging many lines of code, and the code would run a lot slower than just calling the appropriate command in bedtools and GATK.

Sign up to request clarification or add additional context in comments.

Comments

2

I am not sure I completely got what you were aiming for with the code, so I might have changed a few things you will have to change back, but this should help - the following piece of code outputs the desired output.

major = open("major.txt", "r")
minor = open("minor.txt", "r")
major_list = []
minor_list = []
for m in major:
    major_list.append(m)

for n in minor:
    minor_list.append(n)

final = []
for i in range(0, len(minor_list)):  # to iterate using the index
    for j in range(0, len(major_list)):

        minor_row = minor_list[i]
        major_row = major_list[j]

        minor_columns = minor_row.split()
        major_columns = major_row.split()

        minor_symbol = minor_columns[0]
        major_symbol = major_columns[0]

        if minor_symbol == major_symbol:
            minor_second_col = int(minor_columns[1])
            minor_third_col = int(minor_columns[2])
            min_range = int(major_columns[1])
            max_range = int(major_columns[2])

            if (minor_second_col <= max_range and minor_second_col >= min_range
                and minor_third_col <= max_range and minor_third_col >= min_range):
                new_line = minor_row.rstrip("\n") + " " + str(major_symbol) + " " + str(min_range) + " " + str(max_range)
                final.append(new_line)


with open("output.txt", "w") as f:
    for item in final:
        f.write("%s\n" % item)

You were iterating on the lines and comparing lines to each other-

m in major:

The split() function break the string of the line where there are white spaces, and returns the result in a list of strings. Check out the documentation for str.split, I think that would help quit a bit.

2 Comments

Hey Elinor, a slight bug in your code is that the comparisons aren't numeric, would need to wrap the numbers with a type constructor.
Hey Damel, thanks - I corrected it now.
0

Hi elly From what I understand of your logic I think the code below should get you off to on the right direction hope this helps.

import re

# read in files
with open("major.txt") as f:
    major = [x.strip("\n") for x in f.readlines()]

with open("minor.txt") as f:
    minor = [x.strip("\n") for x in f.readlines()]

# split into list of lists
p = re.compile(" +")
major = list(map(lambda x: p.split(x), major))
minor = list(map(lambda x: p.split(x), minor))

with open("output.txt", "w") as out:
    # uses the fact that the lists of lists contain 4 ite
    for major_col1, major_col2, major_col3, major_col4 in major:
        for minor_col1, minor_col2, minor_col3, minor_col4 in minor:
            if major_col1 == minor_col1:
                if int(major_col2) < int(minor_col2) and \
                   int(major_col2) < int(minor_col3) and \
                   int(major_col3) > int(minor_col2) and \
                   int(major_col3) > int(minor_col3):
                   out.write("{0:<10} {1:^8} {2:^8} {3:<8} {4:<10} {5:^8} {6:^}\n"
                             .format(minor_col1, minor_col2,
                                     minor_col3, minor_col4,
                                     major_col1, major_col2, 
                                     major_col3))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.