parsing different xml files with python

Question

I have 2 xml files, word and topic.

I need to parse the word files based on the topic file. Files as below

file 1 topic

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<nite:root nite:id="ES2002a.topic" 
xmlns:nite="http://nite.sourceforge.net/">
<topic nite:id="ES2002a.topic.vkaraisk.1" other_description="introduction of participants and their roles">
      <nite:pointer role="scenario_topic_type"  href="default-topics.xml#id(top.4)"/>
      <nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words0)..id(ES2002a.B.words5)"/>
      <nite:child href="ES2002a.D.words.xml#id(ES2002a.D.words0)..id(ES2002a.D.words3)"/>
      <nite:child

File 2 word (ES2002a.B.words.xml)

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<nite:root nite:id="ES2002a.B.words" xmlns:nite="http://nite.sourceforge.net/">
   <w nite:id="ES2002a.B.words0" starttime="50.42" endtime="50.99">Okay</w>
   <w nite:id="ES2002a.B.words1" starttime="50.99" endtime="50.99" punc="true">.</w>
   <w nite:id="ES2002a.B.words2" starttime="53.56" endtime="53.96">Right</w>
   <w nite:id="ES2002a.B.words3" starttime="53.96" endtime="53.96" punc="true">.</w>
   <vocalsound nite:id="ES2002a.B.words4" starttime="55.415" endtime="55.415" type="other"/>
   <w nite:id="ES2002a.B.words5" starttime="55.98" endtime="56.53">Um</w>

File 2 word (ES2002a.D.words.xml)

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<nite:root nite:id="ES2002a.D.words" xmlns:nite="http://nite.sourceforge.net/">
   <w nite:id="ES2002a.D.words0" starttime="67.21" endtime="67.45">Mm-hmm</w>
   <w nite:id="ES2002a.D.words1" starttime="67.45" endtime="67.45" punc="true">.</w>
   <w nite:id="ES2002a.D.words2" starttime="74.89" endtime="75.24">Great</w>
   <w nite:id="ES2002a.D.words3" starttime="75.24" endtime="75.24" punc="true">.</w>
   <w nite:id="ES2002a.D.words4" starttime="82.08" endtime="82.25">And</w>
   <w nite:id="ES2002a.D.words5" starttime="82.25" endtime="82.43">I&#39;m</w>

There are multiple word files that need to be parsed based on the topic file.

  <nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words0)..id(ES2002a.B.words5)"/>

we see that the topic file is saying get words 1-5 from file ES2002a.B.words

the desired output is Okay . Right . Um m-hmm . Great

I have parsed in the topic file, although the code is clunky

from lxml import etree
tree = etree.parse("./ES2013a.topic.xml") 
root = tree.getroot() 
childA = []
elementT = []
ElementA = []
for child in root:
    elementT.append(str(child.tag))
    ElementA.append(str(child.attrib))
    childA.append(str(child.attrib))
    for element in child:
        elementT.append(str(element.tag))
        #childA.append(child.attrib)
        ElementA.append(str(element.attrib))
        childA.append(str(child.attrib))
        for sub in element:
            #print('***', child.attrib , ':' , element.tag, ':' , element.attrib, '***')
            #childA.append(child.attrib)
            elementT.append(str(sub.tag))
            ElementA.append(str(sub.attrib))
            childA.append(str(child.attrib))

df = pd.DataFrame()
df['c'] = np.array (childA)
df['t'] = np.array(ElementA)
df['a'] = np.array(elementT)

file = df['t'].str.extract(r'([A-Z][A-Z].*[words.xml])#')
start = df['t'].str.extract(r'words([0-9]+)')
stop = df['t'].str.extract(r'.*words([0-9]+)')
tags = df['a'].str.extract(r'.*([topic]|[pointer]|[child])')
rootTopic = df['c'].str.extract(r'ES2013a.topic.rdhillon.(\d+)')
df['f'] = file
df['start'] = start
df['stop'] = stop
df['tags'] = tags
# c= topic
# r = pointerr
# d= child
df['topicID'] = rootTopic

df = df.iloc[:,3:]

I am thinking of getting a lit of the word files used and then iterating over the word file based on the start and stop conditions

The question is unclear. You have tagged the question with "beautifulsoup", "lxml", and "elementree", but you have not shown us any code. What have you tried? — mzjn
– mzjn, Commented May 29, 2018 at 6:50

Sijan Bhandari · Accepted Answer · 2018-06-01 03:52:03Z

1

I once used jxmlease to parse the xml. It simply converts the XML string into python dictionary.

topic.xml

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<nite:root nite:id="ES2002a.topic" 
xmlns:nite="http://nite.sourceforge.net/">
<topic nite:id="ES2002a.topic.vkaraisk.1" other_description="introduction of participants and their roles">
      <nite:pointer role="scenario_topic_type"  href="default-topics.xml#id(top.4)"/>
      <nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words0)..id(ES2002a.B.words5)"/>
      <nite:child href="ES2002a.D.words.xml#id(ES2002a.D.words0)..id(ES2002a.D.words3)"/>
</topic>
</nite:root>




import jxmlease

with open('topic.xml') as topic:
    topic_content = topic.read()

root = jxmlease.parse(topic_content)
first_word_selection = root['nite:root']['topic']['nite:child'][0].get_xml_attr("href")

print(first_word_selection)
output : ES2002a.D.words.xml#id(ES2002a.D.words0)..id(ES2002a.D.words3)

edited Jun 1, 2018 at 3:52

answered May 30, 2018 at 1:32

Sijan Bhandari

3,0713 gold badges25 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Pythonuser Over a year ago

Hi Sijan. I justr tried running that code and i got the following errorTypeError: list indices must be integers or slices, not str

Sijan Bhandari Over a year ago

Hi @Pythonuser, I have added my topic.xml content above. Can you run that once? I is working fine at my end.

Pythonuser Over a year ago

Hi @Sijan Bhandari, i ran the xml file, using the content and got the following error: KeyError: 'nite:root'

Sijan Bhandari Over a year ago

You have installed jxmlease right? You can try to print root variable and look whether it is parsing well or not.

Collectives™ on Stack Overflow

parsing different xml files with python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related