0

I am very new to python, and I have a python script to run for a particular file (input1.txt) and generated a output (output1.fasta), but I would like to run this script for multiple files, for example: input2.txt, input3.txt...and generate the respective output: output2.fasta, output3.fasta

from Bio import SeqIO

fasta_file = "sequences.txt" 
wanted_file = "input1.txt" 
result_file = "output1.fasta" 

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

I tried to add the glob function, but I do not know how to deal with the output file name.

from Bio import SeqIO
import glob

fasta_file = "sequences.txt"

for filename in glob.glob('*.txt'):

    wanted = set()
    with open(filename) as f:
        for line in f:
            line = line.strip()
            if line != "":
                wanted.add(line)

    fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
    with open(result_file, "w") as f:
        for seq in fasta_sequences:
            if seq.id in wanted:
                SeqIO.write([seq], f, "fasta")

The error message is: NameError: name 'result_file' is not defined

4
  • 1
    what's "not working" exactly? Can you show your code after you've tried with glob? Commented Aug 28, 2017 at 15:25
  • 1
    What isn't working with glob? Be specific so we can help. Commented Aug 28, 2017 at 15:25
  • sorry, I updated my question with the error message, etc. Commented Aug 28, 2017 at 15:49
  • You need to define your result_file variable at some point. Please see my answer for the issue with your current use of glob and how to create the result_file name based on the wanted_file name (as you previously called that variable). Commented Aug 28, 2017 at 15:57

1 Answer 1

3

Your glob is currently pulling your "sequences" file as well as the inputs because *.txt includes the sequences.txt file. If the "fasta" file is always the same and you only want to iterate the input files, then you need

for filename in glob.glob('input*.txt'):

Also, to iterate through your entire process, perhaps you want to put it within a method. And if the output filename is always created to correspond to the input, then you can create that dynamically.

from Bio import SeqIO

def create_fasta_outputs(fasta_file, wanted_file):
    result_file = wanted_file.replace("input","output").replace(".txt",".fasta")

    wanted = set()
    with open(wanted_file) as f:
        for line in f:
            line = line.strip()
            if line != "":
                wanted.add(line)
    fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
    with open(result_file, "w") as f:
        for seq in fasta_sequences:
            if seq.id in wanted:
                SeqIO.write([seq], f, "fasta")

fasta_file = "sequences.txt"
for wanted_file in glob.glob('input*.txt'):
    create_fasta_outputs(fasta_file, wanted_file)
Sign up to request clarification or add additional context in comments.

3 Comments

yes my fasta_file = "sequences.txt" is the same for all the input files. Your command is running without any error, but it is not creating any output.
Do you have sample data for the sequences.txt and input1.txt files? Now that the script will run, it is possible that some of your logic within it, such as if line != "": or if seq.id in wanted: is causing a lack of output.
You are correct, my fault. The command is running! thanks a lot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.