summarizing 2 text files into one file in python

Question

I have 2 files called big and small like these examples:

big:

chr1    transcript      2481359 2483515 -       RP3-395M20.8
chr1    transcript      2487078 2492123 +       TNFRSF14
chr1    transcript      2497849 2501297 +       RP3-395M20.7
chr1    transcript      2512999 2515942 +       RP3-395M20.9
chr1    transcript      2517930 2521041 +       FAM213B
chr1    transcript      2522078 2524087 -       MMEL1

small:

chr1    2487088 2492113 17
chr1    100757323       100757324       19
chr1    2487099 2492023 21
chr1    100758316       100758317       41
chr1    2514000 2515742 14

I trying to make a new file with 5 columns from big file upon the following conditions:

conditions :

1- if: the 1st column of small file == 1st column of big file
2- if: the 4th column of big file >= the 2nd column of small file >= the 3rd column of big file
3- if: the 4th column of big file >= the 3rd column of small file >= the 3rd column of big file

columns in output file:

1) 1st column of big file
2) 2nd column of big file
3) 3rd column of big file
4) the number of lines in small files that have the mentioned conditions (we should count)
5) 6th column of big file

here is the expected output for the above example:

chr1    2487078 2492123 2       TNFRSF14
chr1    2512999 2515942 1       RP3-395M20.9

I wrote the following code in python. it does not return the file that I want. every line in my code seems to be logical. would you help my to fix it?

def correspond(big, small, outfile):
    count = 0
    big = open(big, "r")
    small = open(small, "r")
    big_list = []
    small_list = []
    for m in big:
        big_list.append(m)
    for n in small:
        small_list.append(n)
    final = []
    for i in range(0, len(small_list)):
        for j in range(0, len(big_list)):
            small_row = small_list[i]
            big_row = big_list[j]
            small_columns = small_row.split()
            big_columns = big_row.split()
            small_symbol = small_columns[0]
            big_symbol = big_columns[0]
            name = big_columns[5]
            if small_symbol == big_symbol:
                small_second_col = small_columns[1]
                small_third_col = small_columns[2]
                min_range = big_columns[2]
                max_range = big_columns[3]
                if (small_second_col <= max_range and small_second_col >= min_range and small_third_col <= max_range and small_third_col >= min_range):
                        count+=1
                        new_line = small_row.rstrip("\n") + " " + big_symbol + " " + min_range + " " + max_range + str(count) + name
                        final.append(new_line)
    with open(outfile, "w") as f:
        for item in final:
            f.write("%s\n" % item)

Surprisingly like earlier questions this week (!), such as stackoverflow.com/q/53174083/2564301. Are y'all in the same class? — Jongware
– Jongware, Commented Nov 13, 2018 at 22:08
Learn bedtools or pybedtools. You're trying to do an intersection of .bed files, and that's why (py)bedtools was developed. See stackoverflow.com/questions/52998160/… — mRotten
– mRotten, Commented Nov 13, 2018 at 23:27

Dalvenjia · Accepted Answer · 2018-11-13 23:24:14Z

Full working solution, no pandas:

from itertools import product


def str_or_int(item):
    try:
        return int(item)
    except ValueError:
        return item

def correspond(big, small, output):
    with open(big, 'r') as bigf, open(small, 'r') as smallf, open(output, 'w') as outputf:
        current = None
        count = 0
        for b_line, s_line in product(filter(lambda x: x != '\n', bigf), filter(lambda x: x != '\n', smallf)):
            if b_line != current:
                if count > 0:
                    out_line = current.split()
                    outputf.write('\t'.join((out_line[0], out_line[1], out_line[2], str(count), out_line[5])) + '\n')
                current = b_line
                count = 0
            b_line = [str_or_int(s) for s in b_line.split()]
            s_line = [str_or_int(s) for s in s_line.split()]
            try:
                if b_line[0] == s_line[0] and b_line[3] >= s_line[1] >= b_line[2] and b_line[3] >= s_line[2] >= b_line[2]:
                    count += 1
            except IndexError:
                continue

Ask in comments if you have questions

blhsing · Accepted Answer · 2018-11-13 23:03:26Z

Given your sample input like this:

big = '''chr1    transcript      2481359 2483515 -       RP3-395M20.8
chr1    transcript      2487078 2492123 +       TNFRSF14
chr1    transcript      2497849 2501297 +       RP3-395M20.7
chr1    transcript      2512999 2515942 +       RP3-395M20.9
chr1    transcript      2517930 2521041 +       FAM213B
chr1    transcript      2522078 2524087 -       MMEL1'''

small = '''chr1    2487088 2492113 17
chr1    100757323       100757324       19
chr1    2487099 2492023 21
chr1    100758316       100758317       41
chr1    2514000 2515742 14'''

big, small = ([l.split() for l in d.splitlines()] for d in (big, small))

You can use sum with a generator expression to count the number of lines in small matching the criteria, and then use str.join to produce your desired output:

for name_big, _, low, high, _, note in big:
    count = sum(1 for name_small, n1, n2, _ in small if name_big == name_small and all(int(low) <= int(n) <= int(high) for n in (n1, n2)))
    if count:
        print('\t'.join((name_big, low, high, str(count), note)))

This outputs:

chr1    2487078 2492123 2   TNFRSF14
chr1    2512999 2515942 1   RP3-395M20.9

Collectives™ on Stack Overflow

summarizing 2 text files into one file in python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related