I have a fasta file with headers like this:
612407518| Streptomyces sp. MJ635-86F5 DNA, cremimycin biosynthetic gene cluster, complete sequence
84617315| Streptomyces achromogenes subsp. rubradiris complete rubradirin biosynthetic gene cluster, strain NRRL 3061
345134845| Streptomyces sp. SN-593 DNA, reveromycin biosynthetic gene cluster, complete sequence
323700993| Streptomyces autulyticus strain CGMCC 0516 geldanamycin polyketide biosynthetic gene cluster, complete sequence
15823967| Streptomyces avermitilis oligomycin biosynthetic gene cluster
1408941746| Streptomyces sp. strain OUC6819 rdm biosynthetic gene cluster, complete sequence
315937014| Uncultured organism CA37 glycopeptide biosynthetic gene cluster, complete sequence
29122977| Streptomyces cinnamonensis polyether antibiotic monensin biosynthetic gene cluster, partial sequence
257129259| Moorea producens 19L curacin A biosynthetic gene cluster, partial sequence
166159347| Streptomyces sahachiroi azinomycin B biosynthetic gene cluster, partial sequence
And I want to only keep the one word right before "biosynthetic gene clusters" in the header description, results are like this:
612407518|cremimycin
84617315|rubradirin
345134845|reveromycin
323700993|polyketide
15823967|oligomycin
1408941746|rdm
315937014|glycopeptide
29122977|monensin
257129259|curacin A
166159347|azinomycin B
Here's what I've tried on my original files with more than 200 headers:
with open("test.txt") as f:
for line in f:
(id, name) = line.strip().split('|')
term_list = name.split()
term_index = term_list.index('biosynthetic')
term = term_list[int(term_index)-1]
header = id + '|' + term
print(header)
The result is good, although he last two headers in my example above yield this:
257129259|A
166159347|B
I'll work on the 2nd problem because my original data contain lots of these.
Thank you all for the comments.
(\d*\|)(?:.*)\s(\w+)\s(?=bio), and then group 1 is the number| and group 2 is the word