2

I apply some regular expression on xml file to find and replace values. Normally it works.(I heard the voices saying "use xml parsers". Meanwhile I can not.) But if there is a special character in the value, it ruins everything.

Think I have a xml file like below:

<fieldset>
  <idle1>
     <value>something\\n</value>
  </idle1>
  <idle2>
    <value>blabla</value>
  </idle2>
</fieldset>

If I try to replace value in "<idle2><value>" node, value of "<idle1><value>" node becomes "something\n". And when it comes to writing to file, xml becomes:

    <fieldset>
      <idle1>
         <value>something
</value>
      </idle1>
      <idle2>
        <value>blabla</value>
      </idle2>
    </fieldset>

Well both in search and replace i use "r" string literal. But it seems not working. I solve the problem. For every search and replace, I replace "\n"s with "\\n" and then I write result to the file. But it is not an efficient way to use.

Is there something I could not see? I just want to write "\\n" to the files. Is this so much for me to want it?

Edit: here is my regexs':

for search :

self.searchPattern=(<fieldset>)(.*?)(<idle2>)(.*?)(<value>)(.*?)(</value>)(.*?)(</idle2>)(.*?)(</fieldset>)

for replace :

self.replacePattern=`\g<1>\g<2>\g<3>\g<4><value>denemeasdasd\\\\n</value>\g<8>\g<9>\g<10>\g<11>`

this is the python code for search:

self.pattern = re.compile(r''''''+self.searchPattern+'''''', flags = re.S | re.U)

and this is for replacing

outtext = self.pattern.sub(r''''''+self.replacePattern+'''''',r''''''+self.match.group(0)+'''''')

2 Answers 2

1

I don't understand your explanations.

Personnaly, I wrote this:

import re

RE = ('(^([ \t]+)<(idle2)>(?:\n|\r\n?)[ \t]+<value>)'
      '(.*?)'
      '(?=</value>(?:\n|\r\n?)\\2</\\3>)')

print repr(ch),'\n'
print ch
print '\n-------------------------------------------------'
print repr(re.sub(RE,'\\1AAA',ch,flags = re.M)) , '\n'
print re.sub(RE,'\\1-----HHHHHHXXXXXXX-------',ch,flags = re.M)

result

'<fieldset>\n  <idle1>\n    <value>something\\n</value>\n  </idle1>\n  <idle2>\n    <value>blabla</value>\n  </idle2>\n</fieldset>'

<fieldset>
  <idle1>
    <value>something\n</value>
  </idle1>
  <idle2>
    <value>blabla</value>
  </idle2>
</fieldset>

-------------------------------------------------
'<fieldset>\n  <idle1>\n    <value>something\\n</value>\n  </idle1>\n  <idle2>\n    <value>AAA</value>\n  </idle2>\n</fieldset>'

<fieldset>
  <idle1>
    <value>something\n</value>
  </idle1>
  <idle2>
    <value>-----HHHHHHXXXXXXX-------</value>
  </idle2>
</fieldset>

Is it what you want ?

Sign up to request clarification or add additional context in comments.

2 Comments

@savruk Thank you. If it's really a good answer and the best one, you could accept it by clicking on the white chevron button , under the triangular downvoting button. It affects 25 points to the answer instead of 10.
@savruk Thanks. I am not obsessed by points but having some allow to be downvoted on other answers without worrying too much on one's stupidity (yes I made stupid answers)
0

I find it best when dealing with unpredictable data sources to whitelist valid characters. So along with whatever other regular expression replace you have going on, remove anything that's not whitelisted i.e. a-z 0-9 : , . -

Look at your data and determine the appropriate whitelist for your task.

6 Comments

Well, what i do is similarly what you explain. But there must be a way to handle it in regex.
so to use my method, before your replacing, you would do a regex replace all non-white listed characters with ''. Then you won't have to worry about handling any hidden or special characters in your current code. With any luck you wont have to change anything you have about.
@savruk What's the connection between <fieldset> , <idle2> , <value> and <name> , <lastname> , <adress> , <workaddress> ? What's the meaning of \g<1> etc ? What is this: denemeasdasd ? What is self.searchPattern ? ..... ??
@eyquem I edited the xml, You can see the connection. \g<1> means group 1 which remains same. And "denemeasdasd\\\n" is a replacement value(new value) of "<idle2><value>".
@savruk "f I try to replace value in "<idle2><value>" node, value of "<idle1><value>" node becomes "something\n". Why ? Show the code please. It should'nt happen. In fact, do you replace in all the nodes <idle.><value> indiferently concerning the number ? ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.