2

This question is a little difficult to articulate with my inadequate English but I will do my best.

I have a directory of xml files, each file contains xml such as:

<root>
    <fields>
        <field>
            <description/>
            <region id="Number.T2S366_R_487" page="1"/>
        </field>
        <field>
            <description/>
            <region id="Number.T2S366_R_488.`0" page="1"/>
            <region id="String.T2S366_R_488.`1" page="1"/>
        </field>
    </fields>
</root>

I'd like to do a String replacement on the lines which contain the dot, tick, number notation such as .`0 with an index notation like [0],[1], [2], ... and so forth.

So the transformed xml payload should look like something below:

<root>
    <fields>
        <field>
            <description/>
            <region id="Number.T2S366_R_487" page="1"/>
        </field>
        <field>
            <description/>
            <region id="Number.T2S366_R_488[0]" page="1"/>
            <region id="String.T2S366_R_488[1]" page="1"/>
        </field>
    </fields>
</root>

How can I accomplish this using python? This seems fairly straight forward to do using regex but that would be difficult to do for a directory of files containing multiple files. I'd like to see an implementation using python 3.x, as I am learning it.

1

3 Answers 3

3

In Python you can loop over all files in your directory with os.listdir and make substitutions in-place with fileinput:

import os
import fileinput

path = '/home/arabian_albert/'
for f in os.listdir(path):
    with fileinput.FileInput(f, inplace=True, backup='.bak') as file:
        for line in file:
            print(re.sub(r'\.`(\d+)', r'\[\1\]', line), end='')

However, you should consider doing this from the command line with sed:

find . -type f -exec sed -i.bak -E "s/\.`([0-9]+)/[\1]/g" {} \;

The above will make the substitution for all files in the current directory, and backup with old files with .bak.

Sign up to request clarification or add additional context in comments.

2 Comments

Oh nice implementation, I especially like the way of backing up old files with the .bak extension. I guess there is no way around regex
Yes, I left that in there in case I screwed up :)
2

You can use a simple regex to do that:

import re
sample_str = """
<root>
    <fields>
        <field id="S366/487" type="xs:int" bind="T2S366/487">
            <description/>
            <region id="WholeNumberWithSeparator.T2S366_R_487" page="1"/>
        </field>
        <field id="S366/488" type="xs:int" bind="T2S366/488">
            <description/>
            <region id="Number.T2S366_R_488.`0" page="1"/>
            <region id="String.T2S366_R_488.`1" page="1"/>
        </field>
    </fields>
</root>
"""
pattern = "\.`(\d+)"
result = re.sub(pattern, lambda x: "[{}]".format(x.groups()[0]), sample_str)
print result

yields

<root>
    <fields>
        <field id="S366/487" type="xs:int" bind="T2S366/487">
            <description/>
            <region id="WholeNumberWithSeparator.T2S366_R_487" page="1"/>
        </field>
        <field id="S366/488" type="xs:int" bind="T2S366/488">
            <description/>
            <region id="Number.T2S366_R_488[0]" page="1"/>
            <region id="String.T2S366_R_488[1]" page="1"/>
        </field>
    </fields>
</root>

1 Comment

Thanks, very straightforward implementation
1

How about this:

wholefile = ''

with open(r'xml_input.xml', 'r+') as f:
    lines = f.readlines()
    for line in lines:
        split_line = line.split('.')  # split at periods
        end_point = split_line.pop(-1)  # get and remove existing endpoint
        if end_point[0] == '`':  # if it matches tick notation
            idx_after_num = end_point.find('"')  # get the first index that matches a double quote
            the_int = end_point[1:idx_after_num]  # slice from after the tick to the end of the int
            end_point = list(end_point)  # convert to list
            del(end_point[:idx_after_num])  # delete up to the double quote
            end_point = ''.join(end_point)  # reconstruct string
            new_endpoint = '[{}]'.format(the_int) + end_point  # create new endpoint
            split_line += [new_endpoint]  # append new endpoint to end of list of split strs
            new_line = ''  # new empty string
            for n, segment in enumerate(split_line):
                if n >= len(split_line) - 2:  # if we're at or beyond the endpoint
                    new_line += segment  # concatenate the new endpoint
                else:
                    new_line += segment + '.'  # concatenate, replacing the needed '.'s
            wholefile += new_line  # replace, with changes
        else:
            wholefile += line  # replace, with no changes

with open('xml_out.xml', 'w+') as f:
    f.write(wholefile)

my output:

<root>
    <fields>
        <field id="S366/487" type="xs:int" bind="T2S366/487">
            <description/>
            <region id="WholeNumberWithSeparator.T2S366_R_487" page="1"/>
        </field>
        <field id="S366/488" type="xs:int" bind="T2S366/488">
            <description/>
            <region id="Number.T2S366_R_488[0]" page="1"/>
            <region id="String.T2S366_R_488[1]" page="1"/>
        </field>
    </fields>
</root>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.