0

I contain a tsv file, and I am trying to print data I acquire to a specific header in my tsv file.

This is what my table looks like. Right now, I am trying to scan through some data, and print to the 1-generated, 1-chains, 2-generated, 2 chains, .... based on the number of matches I find.

example

The problem is, I need to print data to each generated column without printing to the first 3 columns. Also, I an trying to print to each column in a specific way. For the "generated" column, I want to print only to the generated column, and not the chains column when looking for generated data. Same for the chains column.

example

In the example, I need to print the first word which comes after REMARK 350 in line 1 "Author" and line 2 print the letter.

Desired Output:

output

2
  • 1
    It is difficult to comment upon this issue without knowing what the pdb files look like and how you would like to write the data to the tsv file. It would be helpful if you could update the question with a plaintext (no image) sample of the tsv file, of a pdb file and an example of the desired output. Also biopandas might be a relevant module for your work. Commented May 3, 2021 at 18:34
  • hello, i updated. Commented May 3, 2021 at 19:19

1 Answer 1

1

I suggest first extracting the relevant data from the files into a list of dictionaries, and then creating the dataframe from that list:

from glob import glob
import pandas as pd

files = glob('./folder_name/*.pdb') #specify the path to your folder with pdb files to create a list of all files

all_data = [] #empty list to populate

for filename in files: #iterate over the files
  with open(filename, "r") as f:
    data = {'FILENAME': filename.split('.')[0]} #create dictionary to populate with data
    lines = f.read().splitlines() #create list of lines

    for line in lines: #iterate over lines
      if 'REMARK 350' in line:
        if 'BIOMOLECULE: ' in line:
          nr = int(line.rsplit(': ', 1)[1].strip()) #extract number
        elif 'AUTHOR DETERMINED BIOLOGICAL UNIT: ' in line:
          data[f'{nr}-generated'] = line.rsplit(': ', 1)[1].strip() #populate dict, key is dynamically generated from the number
        elif 'APPLY THE FOLLOWING TO CHAINS: ' in line:
          data[f'{nr}-chains'] = line.rsplit(': ', 1)[1].strip()
    data['BIOMOLECULES'] = list(range(1, nr+1)) #add list of biomolecules
  all_data.append(data) #append dict to list

df = pd.DataFrame(all_data) #create dataframe

Running this on two pdb files I got from github got me this output:

FILENAME 1-generated 1-chains 2-generated 2-chains BIOMOLECULES
0 4gjt TETRAMERIC A, B MONOMERIC C [1, 2]
1 2n0n_M1 MONOMERIC A [1]
Sign up to request clarification or add additional context in comments.

6 Comments

I tried this. Is there a way to use pandas to simply search for each keyword in all of my files and print to each column based on column name?
Well that would be possible, but inefficient as it would require a lookup (iteration over the lines) for every single keyword search. In my example I iterate only once per file.
How should I specify the location of the file for line number 3? I tried putting the direct location, but then I got an error saying index out of range for nr = int(line.rsplit(': ', 1)[1].strip())
The example assumes all pdb files are in the same folder. Glob creates a list of files in the folder and loops over them. glob('./folder_name/*.pdb'): replace ./folder name/ with the path to the folder with pdb files.*.pdb indicated that glob should look for all files with the extension .pdb.
I got an error: nr = int(line.rsplit(': ', 1)[1].strip()) #extract number IndexError: list index out of range
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.