0

I have a small set back I have a large xml file in the following format

<doc id="1">Some text</doc>
<doc id="2">more text</doc>

Im using the following python script to convert into a json format:

from sys import stdout

import xmltodict
import gzip
import json

count = 0
xmlSrc = 'text.xml.gz'
jsDest = 'js/cd.js'

def parseNode(_, node):
    global count
    count += 1
    stdout.write("\r%d" % count)

    jsonNode = json.dumps(node)
    f.write(jsonNode + '\n')
    return True

f = open(jsDest, 'w')

xmltodict.parse(gzip.open(xmlSrc), item_depth=2, item_callback=parseNode)

f.close()

stdout.write("\n") # move the cursor to the next line

Is it possible to detected the end </doc> and break and then continue converting? Ive looked at other stackoverflow question but none help. How do you parse nested XML tags with python?

6
  • Are your <doc> tags flat and not nested? Commented Nov 9, 2014 at 18:20
  • Hi Anzel my tags are <doc id="12" url="example.com" title="Anarchism"> Anarchism .... </doc> <doc id="123" url="example2" title="Laptop"> Laptop .... </doc> inside one large xml that i wish to parse or breakup in json to import into mongo Commented Nov 9, 2014 at 18:25
  • OK, do you want to extract <doc> elements or EXCLUDE them? Commented Nov 9, 2014 at 18:26
  • I want to extract <doc> and get a format that mongodb friendly { doc:[ { id:307, url:'en.wikipedia.org/wiki?curid=307', title:'Abraham Lincoln' }, Commented Nov 9, 2014 at 18:33
  • thomasfrank.se/xml_to_json.html Commented Nov 9, 2014 at 18:33

1 Answer 1

0

As your <doc> tag isn't nested itself, you can iterate the document and manually serialize the object and dump to json, here is an example:

import xml.etree.ElementTree as ET
import json

s= '''
<root>
    <abc>
        <doc id="12" url="example.com" title="Anarchism"> Anarchism .... </doc>
    </abc>
    <doc id="123" url="example2" title="Laptop"> Laptop .... </doc>
    <def>
        <doc id="3">Final text</doc>
    </def>
</root>
'''

tree = ET.fromstring(s)
j = []
# use iterfind will iterate the element tree and find the doc element
for node in tree.iterfind('.//doc'):
    # manually build the dict with doc attribute and text
    attrib = {}
    attrib.update(node.attrib)
    attrib.update({'text': node.text})
    d = {'doc': [ attrib ] }
    j.append(d)

json.dumps(j)
'[{"doc": [{"url": "example.com", "text": " Anarchism .... ", "id": "12", "title": "Anarchism"}]}, {"doc": [{"url": "example2", "text": " Laptop .... ", "id": "123", "title": "Laptop"}]}, {"doc": [{"text": "Final text", "id": "3"}]}]'

# to write to a json file
with open('yourjsonfile', 'w') as f:
    f.write(json.dumps(j))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.