Loop through pair of files python

Question

I have a script that receives two files as input and creates a dictionary based on lines. Finally, it overwrites the first file.

I am looking for a way to run this script on all file pairs of a folder, choosing as sys.argv[1] and sys.argv[2] based on a pattern in the name.

import re
import sys

datafile = sys.argv[1]
schemaseqs = sys.argv[2]

datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
    i = 0
    for line in f:
        if i % 2 == 0:
            d[line.strip()]=0
            prev = line.strip()
        else:
            d[prev] = line.strip()
        i+=1

new_d = {}
with open(schemaseqs, 'r') as f:
    i=0
    prev = None
    for line in f:
        if i % 2 == 0:
            new_d[line.strip()]=0
            prev = line.strip()
        else:
            new_d[prev] = line.strip()
        i+=1

for key, value in d.items():
    if value in new_d:
        d[key] = new_d[value]

print(d)

with open(datafile,'w') as filee:
    for k,v in d.items():
        filee.writelines(k)
        filee.writelines('\n')
        filee.writelines(v)
        filee.writelines('\n')

I have hundreds of file pairs all sharing the same pattern proteinXXXX (where XXXX is a number) This number can have up to four digits (e.g. 9,99,999 or 9999). So I have protein 555.txt and protein 555.fasta

I've seen I can use glob or os.listdir to read files from a directory. However, I cannot assign them to a variable and extract the lines one pair at a time in every pair of the directory.

Any help is appreciated.

are you reading files from single folder or multiple folders? — deadshot
– deadshot, Commented May 8, 2020 at 20:37

dlask · Accepted Answer · 2020-05-08 21:39:48Z

2

Just the concept.

Import required libraries.

import glob
import os.path

Define function that extracts only the basename (the part without extension) from filename.

def basename(fn):
    return os.path.splitext(os.path.basename(fn))[0]

Create two sets, one with .txt files, another with .fasta files.

t = {basename(fn) for fn in glob.glob("protein*.txt")}
f = {basename(fn) for fn in glob.glob("protein*.fasta")}

Calculate intersection of these two sets to be sure that both .txt and .fasta files exist with the same basename. Then add the missing suffixes and let them process with the existing code.

for bn in t.intersection(f):
    process(bn + ".txt", bn + ".fasta")

answered May 8, 2020 at 21:39

dlask

9,0421 gold badge29 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Loop through pair of files python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related