Python: replace only one occurrence in a string

Question

I have some sample data which looks like:

ATOM    973  CG  ARG A  61     -21.593   8.884  69.770  1.00 25.13           C
ATOM    974  CD  ARG A  61     -21.610   7.433  69.314  1.00 23.44           C
ATOM    975  NE  ARG A  61     -21.047   7.452  67.937  1.00 12.13           N

I want to replace the 6th column and only the 6th column by the addition of the offset value, in the case above it is 308.

So 61+308 = 369, so 61 in the 6th column should be replaced by 369

I can't str.split() the line as the line spacing is very important.

I have tried tried using str.replace() but the values in column 2 can also overlap with column 6

I did try reversing the line and use str.repalce() but the values in columns 7,8,9,10 and 11 can overlap with the str to be replaced.

The ugly code I have so far is (which partially works apart from if the values overlap in columns 7,8,9,10 and/or 11):

with open('2kqx.pdb', 'r') as inf, open('2kqx_renumbered.pdb', 'w') as outf:
    for line in inf:
        if line.startswith('ATOM'):
            segs = line.split()
            if segs[4] == 'A':
                offset = 308
                number = segs[5][::-1]
                replacement = str((int(segs[5])+offset))[::-1]
                print number[::-1],replacement
                line_rev = line[::-1]
                replaced_line = line_rev.replace(number,replacement,1)
                print line
                print replaced_line[::-1]
                outf.write(replaced_line[::-1])

The code above produced this output below. As you can see in the second line the 6th column is not changed, but is changed in column 7. I thought by reversing the string I could bypass the potential overlap with column 2, but I forgot about the other columns and I dont really know how to get around it.

ATOM    973  CG  ARG A  369     -21.593   8.884  69.770  1.00 25.13           C
ATOM    974  CD  ARG A  61     -21.3690   7.433  69.314  1.00 23.44           C
ATOM    975  NE  ARG A  369     -21.047   7.452  67.937  1.00 12.13           N

Detail, but, as the numbers in the 6th column probably get an extra digit, do you want that digit 1/ shift all further columns to the right, or 2/ eat up a space on the right side, or 3/ eat up a space on the left side? — user707650
– user707650, Commented Feb 22, 2013 at 12:20
You should edit your question (heading) such that it better shows that are looking for solutions to shift residue-ids in PDB-files programmatically from within Python. Currently it is likely not to be found by other users with a similar problem. — tzelleke
– tzelleke, Commented Feb 22, 2013 at 15:27

user707650 · Accepted Answer · 2013-02-22 12:29:01Z

2

data = """\
ATOM    973  CG  ARG A  61     -21.593   8.884  69.770  1.00 25.13           C
ATOM    974  CD  ARG A  61     -21.610   7.433  69.314  1.00 23.44           C
ATOM    975  NE  ARG A  61     -21.047   7.452  67.937  1.00 12.13           N"""

offset = 308
for line in data.split('\n'):
    line = line[:22] + "  {:<5d}  ".format(int(line[22:31]) + offset) + line[31:]
    print line

I haven't done the exact counting of whitespace, that's just a rough estimate. If you want more flexibility than just having the numbers 22 and 31 scattered in your code, you'll need a way to determine your start and end index (but that contrasts my assumption that the data is in fixed column format).

answered Feb 22, 2013 at 12:29

user707650

Sign up to request clarification or add additional context in comments.

3 Comments

user707650 Over a year ago

Note: you can play around with the formatter {:3d} for the output to be left or right aligned, with {:<3d} or {:>3d}. See docs.python.org/2/library/….

user707650 Over a year ago

Minor edit: I've changed the formatting so that new number eats up space to the right, and should allow for larger numbers as well (hence a change from {:3d} to {:<5d}).

Harpal Over a year ago

Thanks this worked great, i'm familiar with string formatting so i'll have play around with it.

tzelleke · Accepted Answer · 2013-02-22 15:21:15Z

1

You better not try to parse PDB-files on your own.

Use a PDB-Parser. There are many freely available inside different bio/computational chemistry packages, for instance

biopython

Here's how to it with biopython, assuming you input is raw.pdb:

from Bio.PDB import PDBParser, PDBIO
parser=PDBParser()
structure = parser.get_structure('some_id', 'raw.pdb')
for r in structure.get_residues():
    r.id = (r.id[0], r.id[1] + 308, r.id[2])
io = PDBIO()
io.set_structure(structure)
io.save('shifted.pdb')

I googled a bit and find a quick solution to solve your specific problem here (without third-party dependencies):

http://code.google.com/p/pdb-tools/

There is -- among many other useful pdb-python-script-tools -- this script pdb_offset.py

It is a standalone script and I just copied its pdb_offset method to show it working, your three-line example code is in raw.pdb:

def pdbOffset(pdb_file, offset):
    """
    Adds an offset to the residue column of a pdb file without touching anything
    else.
    """

    # Read in the pdb file
    f = open(pdb_file,'r')
    pdb = f.readlines()
    f.close()

    out = []
    for line in pdb:
        # For and ATOM record, update residue number
        if line[0:6] == "ATOM  " or line[0:6] == "TER   ":
            num = offset + int(line[22:26])
            out.append("%s%4i%s" % (line[0:22],num,line[26:]))
        else:
            out.append(line) 

    return "".join(out)


print pdbOffset('raw.pdb', 308)

which prints

ATOM    973  CG  ARG A 369     -21.593   8.884  69.770  1.00 25.13           C
ATOM    974  CD  ARG A 369     -21.610   7.433  69.314  1.00 23.44           C
ATOM    975  NE  ARG A 369     -21.047   7.452  67.937  1.00 12.13           N

edited Feb 22, 2013 at 15:21

answered Feb 22, 2013 at 13:24

tzelleke

15.4k5 gold badges35 silver badges49 bronze badges

2 Comments

user707650 Over a year ago

Can you explain why not do it as in the accepted answer, but using a PDB-Parser. Also, what is a PDB parser (pdb to me means Python debugger).

tzelleke Over a year ago

@Evert the question concerns protein structure data that is stored in a file format called PDB -- I guess it stands for ProteinDataBase ? There are many subtle issues with this file format such that one better not attemps to write his own parser but rely on well-established tools to do the job. Your answer is certainly the best fit for how the question is presented here -- but I refer to the OPs way of parsing a pdb-file on his own. I'll make that clear in my answer.

Collectives™ on Stack Overflow

Python: replace only one occurrence in a string

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related