0

I have a script that receives two files as input and creates a dictionary based on lines. Finally, it overwrites the first file.

I am looking for a way to run this script on all file pairs of a folder, choosing as sys.argv[1] and sys.argv[2] based on a pattern in the name.

import re
import sys

datafile = sys.argv[1]
schemaseqs = sys.argv[2]

datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
    i = 0
    for line in f:
        if i % 2 == 0:
            d[line.strip()]=0
            prev = line.strip()
        else:
            d[prev] = line.strip()
        i+=1

new_d = {}
with open(schemaseqs, 'r') as f:
    i=0
    prev = None
    for line in f:
        if i % 2 == 0:
            new_d[line.strip()]=0
            prev = line.strip()
        else:
            new_d[prev] = line.strip()
        i+=1

for key, value in d.items():
    if value in new_d:
        d[key] = new_d[value]

print(d)

with open(datafile,'w') as filee:
    for k,v in d.items():
        filee.writelines(k)
        filee.writelines('\n')
        filee.writelines(v)
        filee.writelines('\n')

I have hundreds of file pairs all sharing the same pattern proteinXXXX (where XXXX is a number) This number can have up to four digits (e.g. 9,99,999 or 9999). So I have protein 555.txt and protein 555.fasta

I've seen I can use glob or os.listdir to read files from a directory. However, I cannot assign them to a variable and extract the lines one pair at a time in every pair of the directory.

Any help is appreciated.

2
  • are you reading files from single folder or multiple folders? Commented May 8, 2020 at 20:37
  • From two different folders, but I can join into one Commented May 9, 2020 at 14:09

1 Answer 1

2

Just the concept.

Import required libraries.

import glob
import os.path

Define function that extracts only the basename (the part without extension) from filename.

def basename(fn):
    return os.path.splitext(os.path.basename(fn))[0]

Create two sets, one with .txt files, another with .fasta files.

t = {basename(fn) for fn in glob.glob("protein*.txt")}
f = {basename(fn) for fn in glob.glob("protein*.fasta")}

Calculate intersection of these two sets to be sure that both .txt and .fasta files exist with the same basename. Then add the missing suffixes and let them process with the existing code.

for bn in t.intersection(f):
    process(bn + ".txt", bn + ".fasta")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.