How to write to a header based on column name in csv with python

Question

I contain a tsv file, and I am trying to print data I acquire to a specific header in my tsv file.

This is what my table looks like. Right now, I am trying to scan through some data, and print to the 1-generated, 1-chains, 2-generated, 2 chains, .... based on the number of matches I find.

The problem is, I need to print data to each generated column without printing to the first 3 columns. Also, I an trying to print to each column in a specific way. For the "generated" column, I want to print only to the generated column, and not the chains column when looking for generated data. Same for the chains column.

In the example, I need to print the first word which comes after REMARK 350 in line 1 "Author" and line 2 print the letter.

Desired Output:

It is difficult to comment upon this issue without knowing what the pdb files look like and how you would like to write the data to the tsv file. It would be helpful if you could update the question with a plaintext (no image) sample of the tsv file, of a pdb file and an example of the desired output. Also biopandas might be a relevant module for your work. — RJ Adriaansen
– RJ Adriaansen, Commented May 3, 2021 at 18:34

RJ Adriaansen · Accepted Answer · 2021-05-03 22:11:39Z

1

I suggest first extracting the relevant data from the files into a list of dictionaries, and then creating the dataframe from that list:

from glob import glob
import pandas as pd

files = glob('./folder_name/*.pdb') #specify the path to your folder with pdb files to create a list of all files

all_data = [] #empty list to populate

for filename in files: #iterate over the files
  with open(filename, "r") as f:
    data = {'FILENAME': filename.split('.')[0]} #create dictionary to populate with data
    lines = f.read().splitlines() #create list of lines

    for line in lines: #iterate over lines
      if 'REMARK 350' in line:
        if 'BIOMOLECULE: ' in line:
          nr = int(line.rsplit(': ', 1)[1].strip()) #extract number
        elif 'AUTHOR DETERMINED BIOLOGICAL UNIT: ' in line:
          data[f'{nr}-generated'] = line.rsplit(': ', 1)[1].strip() #populate dict, key is dynamically generated from the number
        elif 'APPLY THE FOLLOWING TO CHAINS: ' in line:
          data[f'{nr}-chains'] = line.rsplit(': ', 1)[1].strip()
    data['BIOMOLECULES'] = list(range(1, nr+1)) #add list of biomolecules
  all_data.append(data) #append dict to list

df = pd.DataFrame(all_data) #create dataframe

Running this on two pdb files I got from github got me this output:

	FILENAME	1-generated	1-chains	2-generated	2-chains	BIOMOLECULES
0	4gjt	TETRAMERIC	A, B	MONOMERIC	C	[1, 2]
1	2n0n_M1	MONOMERIC	A			[1]

edited May 3, 2021 at 22:11

answered May 3, 2021 at 20:13

RJ Adriaansen

9,7092 gold badges16 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

jack22321 Over a year ago

I tried this. Is there a way to use pandas to simply search for each keyword in all of my files and print to each column based on column name?

RJ Adriaansen Over a year ago

Well that would be possible, but inefficient as it would require a lookup (iteration over the lines) for every single keyword search. In my example I iterate only once per file.

jack22321 Over a year ago

How should I specify the location of the file for line number 3? I tried putting the direct location, but then I got an error saying index out of range for nr = int(line.rsplit(': ', 1)[1].strip())

RJ Adriaansen Over a year ago

The example assumes all pdb files are in the same folder. Glob creates a list of files in the folder and loops over them. glob('./folder_name/*.pdb'): replace ./folder name/ with the path to the folder with pdb files.*.pdb indicated that glob should look for all files with the extension .pdb.

jack22321 Over a year ago

I got an error: nr = int(line.rsplit(': ', 1)[1].strip()) #extract number IndexError: list index out of range

|

Collectives™ on Stack Overflow

How to write to a header based on column name in csv with python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related