3

I have this partial XML

   string = ''' 
   <x:root>
       <x:tag1 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue" />
       <x:tag2 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue">
          someValue
       </x:tag2>
       <x:tag3> someValue
    '''

Now I would like to "stupidly" repair it. I have thought of a way- regexing all of the start elements and ending element --> checking which element is missing and just add it. with out getting into too much of details of course. what I've come with so far is (and this does not work):

import re
starts = re.compile('(?<=<)x:\w+(?=>)|(?<=<)x:\w+(?! .+ />)')
print(start.findall(string))

what I expect is a list of x:root , x:tag2 , x:tag3

I've been googling and trying alot but could not find an answer. They only thing I get from this expression is x:root , x:tag1 , x:tag3.

Please help

Thanks

2

3 Answers 3

1

BeautifulSoup might be able to repair it:

import BeautifulSoup

content = ''' 
<x:root>
   <x:tag1 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue" />
   <x:tag2 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue">
      someValue
   </x:tag2>
   <x:tag3> someValue
'''

soup = BeautifulSoup.BeautifulStoneSoup(content)
print(soup.prettify())

yields

<x:root>
 <x:tag1 x:anyattrib="anyValue" x:anyattrib="anyValue" x:anyattrib="anyValue">
  <x:tag2 x:anyattrib="anyValue" x:anyattrib="anyValue" x:anyattrib="anyValue">
   someValue
  </x:tag2>
  <x:tag3>
   someValue
  </x:tag3>
 </x:tag1>
</x:root>
Sign up to request clarification or add additional context in comments.

1 Comment

Hey and thanks. My first issue is that I can't get download any modules to the machine I work on. My second issue is that I could notice that when using BS tag1 gets open and later gets closed. I need to leave tag1 as it was. Any thoughts?
0

Thanks for alexis for helping me.

The correct expression is:

re.findall(r'<\s*(w:\w+)[^>]*(?<!/)>', string)

Using this expression, you'll be able to extract both cases:

first <tag>

second <tag attrib1="value" attrib2="value" attribN="value"/>

I tried to use some built-in python parsers with no luck, including Beautifulsoup which unfortunately did not fix the XML exactly the way I expected it to.

Have a good one! :)

Comments

0

use sgmlib which comes with default python.. . input1

string1 = '''
   <root xmlns:x='www.test.com'>
       <x:tag1 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue" />
       <x:tag2 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue">
          someValue
       </x:tag2>
       <x:tag3> someValue
    '''

import re
import sgmllib
sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
starts = re.findall(sgmllib.tagfind, string1)
print starts

output1

['root', 'xmlns:x', 'www.test.com', 'x:tag1', 'x:anyAttrib', 'anyValue', 'x:anyAttrib', 'anyValue', 'x:anyAttrib', 'anyValue', 'x:tag2', 'x:anyAttrib', 'anyValue', 'x:anyAttrib', 'anyValue', 'x:anyAttrib', 'anyValue', 'someValue', 'x:tag2', 'x:tag3', 'someValue']

or input2

starts1 = re.finditer(sgmllib.tagfind, string1)
for x in starts1:
    print x.start(), x.end(), x.group(0)

output2:

5 9 root
10 17 xmlns:x
19 31 www.test.com
42 48 x:tag1
49 60 x:anyAttrib
62 70 anyValue
72 83 x:anyAttrib
85 93 anyValue
95 106 x:anyAttrib
108 116 anyValue
129 135 x:tag2
136 147 x:anyAttrib
149 157 anyValue
159 170 x:anyAttrib
172 180 anyValue
182 193 x:anyAttrib
195 203 anyValue
216 225 someValue
235 241 x:tag2
251 257 x:tag3
259 268 someValue

or use elementTree which also comes with default python. http://docs.python.org/2/library/xml.etree.elementtree.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.