Parse error with same name nested tags using python xml to json

Question

I have a small set back I have a large xml file in the following format

<doc id="1">Some text</doc>
<doc id="2">more text</doc>

Im using the following python script to convert into a json format:

from sys import stdout

import xmltodict
import gzip
import json

count = 0
xmlSrc = 'text.xml.gz'
jsDest = 'js/cd.js'

def parseNode(_, node):
    global count
    count += 1
    stdout.write("\r%d" % count)

    jsonNode = json.dumps(node)
    f.write(jsonNode + '\n')
    return True

f = open(jsDest, 'w')

xmltodict.parse(gzip.open(xmlSrc), item_depth=2, item_callback=parseNode)

f.close()

stdout.write("\n") # move the cursor to the next line

Is it possible to detected the end </doc> and break and then continue converting? Ive looked at other stackoverflow question but none help. How do you parse nested XML tags with python?

Hi Anzel my tags are <doc id="12" url="example.com" title="Anarchism"> Anarchism .... </doc> <doc id="123" url="example2" title="Laptop"> Laptop .... </doc> inside one large xml that i wish to parse or breakup in json to import into mongo — user2650420
– user2650420, Commented Nov 9, 2014 at 18:25
I want to extract <doc> and get a format that mongodb friendly { doc:[ { id:307, url:'en.wikipedia.org/wiki?curid=307', title:'Abraham Lincoln' }, — user2650420
– user2650420, Commented Nov 9, 2014 at 18:33

Anzel · Accepted Answer · 2014-11-09 19:06:14Z

As your <doc> tag isn't nested itself, you can iterate the document and manually serialize the object and dump to json, here is an example:

import xml.etree.ElementTree as ET
import json

s= '''
<root>
    <abc>
        <doc id="12" url="example.com" title="Anarchism"> Anarchism .... </doc>
    </abc>
    <doc id="123" url="example2" title="Laptop"> Laptop .... </doc>
    <def>
        <doc id="3">Final text</doc>
    </def>
</root>
'''

tree = ET.fromstring(s)
j = []
# use iterfind will iterate the element tree and find the doc element
for node in tree.iterfind('.//doc'):
    # manually build the dict with doc attribute and text
    attrib = {}
    attrib.update(node.attrib)
    attrib.update({'text': node.text})
    d = {'doc': [ attrib ] }
    j.append(d)

json.dumps(j)
'[{"doc": [{"url": "example.com", "text": " Anarchism .... ", "id": "12", "title": "Anarchism"}]}, {"doc": [{"url": "example2", "text": " Laptop .... ", "id": "123", "title": "Laptop"}]}, {"doc": [{"text": "Final text", "id": "3"}]}]'

# to write to a json file
with open('yourjsonfile', 'w') as f:
    f.write(json.dumps(j))

Collectives™ on Stack Overflow

Parse error with same name nested tags using python xml to json

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related