0

I have two files.
One file has two columns-let's call it db, and the other one has one column-let's call it in.
Second column in db is the same type as the column in in and both files are sorted by this column.
db for example:

RPL24P3 NG_002525   
RPLP1P1 NG_002526   
RPL26P4 NG_002527
VN2R11P NG_006060   
VN2R12P NG_006061   
VN2R13P NG_006062   
VN2R14P NG_006063

in for example:

NG_002527
NG_006062

I want to read through these files and get the output as follows:

NG_002527: RPL26P4
NG_006062: VN2R13P

Meaning that I'm iterating on in lines and trying to find the matching line in db.
The code I have written for that is:

    with open(db_file, 'r') as db, open(sortIn, 'r') as inF, open(out_file, 'w') as outF:
        for line in inF:
            for dbline in db:
                if len(dbline) > 1:
                    dbline = dbline.split('\t')
                    if line.rstrip('\n') ==  dbline[db_specifications[0]]:
                        outF.write(dbline[db_specifications[0]] + ': ' + dbline[db_specifications[1]] + '\n')
                        break

*db_specification isn't relevant for this problem, hence I didn't copy the relevant code for it - the problem doesn't lie there.

The current code will find a match and write it as I planned just for the first line in in but won't find any matches for the other lines. I have a suspicion it has to do with break but I can't figure out what to change.

3
  • break breaks out of your for loop. Have you tried removing break and seeing if that fixes your code? Commented Aug 9, 2020 at 12:23
  • shouldn't it break just the secondary loop? And I did try, but if I don't use that break statement I will scan the entire db file with the same in line and by the time I want to scan with the next in line I have read all of db, and I don't want to use db.seek(0) and rescan it again. the db file is supposed to be huge. Commented Aug 9, 2020 at 12:27
  • 1
    To clarify break Python break will continue to the next line following the body of the loop. If it is nested, then it will break the innerloop, catching the next logical jump. (If an inner loop hits the break statement, it will only break from the inner loop.) Commented Aug 9, 2020 at 16:28

2 Answers 2

2

Since the data in the db_file is sorted by second column, you can use this code to read the file.

with open("xyz.txt", "r") as db_file, open("abc.txt", "r") as sortIn, open("out.txt", 'w') as outF:

    #first read the sortIn file as a list
    i_list = [line.strip() for line in sortIn.readlines()]

    #for each record read from the file, split the values into key and value
    for line in db_file:
        t_key,t_val = line.strip().split(' ')

        #if value is in i_list file, then write to output file
        if t_val in i_list: outF.write(t_val + ': ' + t_key + '\n')

        #if value has reached the max value in sort list
        #then you don't need to read the db_file anymore
        if t_val == i_list[-1]: break

The output file will have the following items:

NG_002527: RPL26P4
NG_006062: VN2R13P

In the above code, we have to read the sortIn list first. Then read each line in the db_file. i_list[-1] will have the max value of sortIn file as the sortIn file is also sorted in ascending order.

The above code will have fewer i/o compared to the below one.

=========== previous answer submission:

Based on how the data has been stored in the db_file, it looks like we have to read the entire file to check against the sortIn file. If the values in the db_file was sorted by the second column, we could have stopped reading the file once the last item in sortIn was found.

With the assumption that we need to read all records from the files, see if the below code works for you.

with open("xyz.txt", "r") as db_file, open("abc.txt", "r") as sortIn, open("out.txt", 'w') as outF:

    #read the db_file and convert it into a dictionary
    d_list = dict([line.strip().split(' ') for line in db_file.readlines()])

    #read the sortIn file as a list
    i_list = [line.strip() for line in sortIn.readlines()]

    #check if the value of each value in d_list is one of the items in i_list
    out_list = [v + ': '+ k for k,v in d_list.items() if v in i_list]

    #out_list is your final list that needs to be written into a file
    #now read out_list and write each item into the file
    for i in out_list:
        outF.write(i + '\n')

The output file will have the following items:

NG_002527: RPL26P4
NG_006062: VN2R13P

To help you, i have also printed the contents in d_list, i_list, and out_list.

The contents in d_list will look like this:

{'RPL24P3': 'NG_002525', 'RPLP1P1': 'NG_002526', 'RPL26P4': 'NG_002527', 'VN2R11P': 'NG_006060', 'VN2R12P': 'NG_006061', 'VN2R13P': 'NG_006062', 'VN2R14P': 'NG_006063'}

The contents in i_list will look like this:

['NG_002527', 'NG_006062']

The contents that get written into the outF file from out_list will look like this:

['NG_002527: RPL26P4', 'NG_006062: VN2R13P']
Sign up to request clarification or add additional context in comments.

2 Comments

The values in the db file are indeed sorted by the second column
updated the code to reflect the feedback from your comments. In the older code, I made an assumption that each line in db_file is unique. If the values in db_file have duplicates, then the dictionary will get updated with the new value. In the newer code, I am writing it directly to the file so we don't have that problem.
0

I was able to solve the problem by inserting the following line:
line = next(inF) before the break statement.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.