Python XML reconstruction using regex

Question

I have this partial XML

   string = ''' 
   <x:root>
       <x:tag1 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue" />
       <x:tag2 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue">
          someValue
       </x:tag2>
       <x:tag3> someValue
    '''

Now I would like to "stupidly" repair it. I have thought of a way- regexing all of the start elements and ending element --> checking which element is missing and just add it. with out getting into too much of details of course. what I've come with so far is (and this does not work):

import re
starts = re.compile('(?<=<)x:\w+(?=>)|(?<=<)x:\w+(?! .+ />)')
print(start.findall(string))

what I expect is a list of x:root , x:tag2 , x:tag3

I've been googling and trying alot but could not find an answer. They only thing I get from this expression is x:root , x:tag1 , x:tag3.

Please help

Thanks

You do realize that welbog.homeip.net/glue/53/XML-is-not-regular right? — abarnert
– abarnert, Commented Oct 25, 2012 at 23:41

unutbu · Accepted Answer · 2012-10-25 21:56:48Z

1

BeautifulSoup might be able to repair it:

import BeautifulSoup

content = ''' 
<x:root>
   <x:tag1 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue" />
   <x:tag2 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue">
      someValue
   </x:tag2>
   <x:tag3> someValue
'''

soup = BeautifulSoup.BeautifulStoneSoup(content)
print(soup.prettify())

yields

<x:root>
 <x:tag1 x:anyattrib="anyValue" x:anyattrib="anyValue" x:anyattrib="anyValue">
  <x:tag2 x:anyattrib="anyValue" x:anyattrib="anyValue" x:anyattrib="anyValue">
   someValue
  </x:tag2>
  <x:tag3>
   someValue
  </x:tag3>
 </x:tag1>
</x:root>

answered Oct 25, 2012 at 21:56

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

devdc Over a year ago

Hey and thanks. My first issue is that I can't get download any modules to the machine I work on. My second issue is that I could notice that when using BS tag1 gets open and later gets closed. I need to leave tag1 as it was. Any thoughts?

devdc · Accepted Answer · 2012-11-24 09:56:30Z

0

Thanks for alexis for helping me.

The correct expression is:

re.findall(r'<\s*(w:\w+)[^>]*(?<!/)>', string)

Using this expression, you'll be able to extract both cases:

first <tag>

second <tag attrib1="value" attrib2="value" attribN="value"/>

I tried to use some built-in python parsers with no luck, including Beautifulsoup which unfortunately did not fix the XML exactly the way I expected it to.

Have a good one! :)

answered Nov 24, 2012 at 9:56

devdc

1611 gold badge4 silver badges14 bronze badges

Comments

namit · Accepted Answer · 2012-11-24 10:48:24Z

use sgmlib which comes with default python.. . input1

string1 = '''
   <root xmlns:x='www.test.com'>
       <x:tag1 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue" />
       <x:tag2 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue">
          someValue
       </x:tag2>
       <x:tag3> someValue
    '''

import re
import sgmllib
sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
starts = re.findall(sgmllib.tagfind, string1)
print starts

output1

['root', 'xmlns:x', 'www.test.com', 'x:tag1', 'x:anyAttrib', 'anyValue', 'x:anyAttrib', 'anyValue', 'x:anyAttrib', 'anyValue', 'x:tag2', 'x:anyAttrib', 'anyValue', 'x:anyAttrib', 'anyValue', 'x:anyAttrib', 'anyValue', 'someValue', 'x:tag2', 'x:tag3', 'someValue']

or input2

starts1 = re.finditer(sgmllib.tagfind, string1)
for x in starts1:
    print x.start(), x.end(), x.group(0)

output2:

5 9 root
10 17 xmlns:x
19 31 www.test.com
42 48 x:tag1
49 60 x:anyAttrib
62 70 anyValue
72 83 x:anyAttrib
85 93 anyValue
95 106 x:anyAttrib
108 116 anyValue
129 135 x:tag2
136 147 x:anyAttrib
149 157 anyValue
159 170 x:anyAttrib
172 180 anyValue
182 193 x:anyAttrib
195 203 anyValue
216 225 someValue
235 241 x:tag2
251 257 x:tag3
259 268 someValue

or use elementTree which also comes with default python. http://docs.python.org/2/library/xml.etree.elementtree.html

Collectives™ on Stack Overflow

Python XML reconstruction using regex

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related