0

I have 2 xml files, word and topic.

I need to parse the word files based on the topic file. Files as below

file 1 topic

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<nite:root nite:id="ES2002a.topic" 
xmlns:nite="http://nite.sourceforge.net/">
<topic nite:id="ES2002a.topic.vkaraisk.1" other_description="introduction of participants and their roles">
      <nite:pointer role="scenario_topic_type"  href="default-topics.xml#id(top.4)"/>
      <nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words0)..id(ES2002a.B.words5)"/>
      <nite:child href="ES2002a.D.words.xml#id(ES2002a.D.words0)..id(ES2002a.D.words3)"/>
      <nite:child 

File 2 word (ES2002a.B.words.xml)

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<nite:root nite:id="ES2002a.B.words" xmlns:nite="http://nite.sourceforge.net/">
   <w nite:id="ES2002a.B.words0" starttime="50.42" endtime="50.99">Okay</w>
   <w nite:id="ES2002a.B.words1" starttime="50.99" endtime="50.99" punc="true">.</w>
   <w nite:id="ES2002a.B.words2" starttime="53.56" endtime="53.96">Right</w>
   <w nite:id="ES2002a.B.words3" starttime="53.96" endtime="53.96" punc="true">.</w>
   <vocalsound nite:id="ES2002a.B.words4" starttime="55.415" endtime="55.415" type="other"/>
   <w nite:id="ES2002a.B.words5" starttime="55.98" endtime="56.53">Um</w>

File 2 word (ES2002a.D.words.xml)

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<nite:root nite:id="ES2002a.D.words" xmlns:nite="http://nite.sourceforge.net/">
   <w nite:id="ES2002a.D.words0" starttime="67.21" endtime="67.45">Mm-hmm</w>
   <w nite:id="ES2002a.D.words1" starttime="67.45" endtime="67.45" punc="true">.</w>
   <w nite:id="ES2002a.D.words2" starttime="74.89" endtime="75.24">Great</w>
   <w nite:id="ES2002a.D.words3" starttime="75.24" endtime="75.24" punc="true">.</w>
   <w nite:id="ES2002a.D.words4" starttime="82.08" endtime="82.25">And</w>
   <w nite:id="ES2002a.D.words5" starttime="82.25" endtime="82.43">I&#39;m</w>

There are multiple word files that need to be parsed based on the topic file.

  <nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words0)..id(ES2002a.B.words5)"/>

we see that the topic file is saying get words 1-5 from file ES2002a.B.words

the desired output is Okay . Right . Um m-hmm . Great

I have parsed in the topic file, although the code is clunky

from lxml import etree
tree = etree.parse("./ES2013a.topic.xml") 
root = tree.getroot() 
childA = []
elementT = []
ElementA = []
for child in root:
    elementT.append(str(child.tag))
    ElementA.append(str(child.attrib))
    childA.append(str(child.attrib))
    for element in child:
        elementT.append(str(element.tag))
        #childA.append(child.attrib)
        ElementA.append(str(element.attrib))
        childA.append(str(child.attrib))
        for sub in element:
            #print('***', child.attrib , ':' , element.tag, ':' , element.attrib, '***')
            #childA.append(child.attrib)
            elementT.append(str(sub.tag))
            ElementA.append(str(sub.attrib))
            childA.append(str(child.attrib))

df = pd.DataFrame()
df['c'] = np.array (childA)
df['t'] = np.array(ElementA)
df['a'] = np.array(elementT)

file = df['t'].str.extract(r'([A-Z][A-Z].*[words.xml])#')
start = df['t'].str.extract(r'words([0-9]+)')
stop = df['t'].str.extract(r'.*words([0-9]+)')
tags = df['a'].str.extract(r'.*([topic]|[pointer]|[child])')
rootTopic = df['c'].str.extract(r'ES2013a.topic.rdhillon.(\d+)')
df['f'] = file
df['start'] = start
df['stop'] = stop
df['tags'] = tags
# c= topic
# r = pointerr
# d= child
df['topicID'] = rootTopic

df = df.iloc[:,3:]

I am thinking of getting a lit of the word files used and then iterating over the word file based on the start and stop conditions

2
  • 1
    The question is unclear. You have tagged the question with "beautifulsoup", "lxml", and "elementree", but you have not shown us any code. What have you tried? Commented May 29, 2018 at 6:50
  • @mzjn i will add what i have tried to the main post Commented May 30, 2018 at 0:44

1 Answer 1

1

I once used jxmlease to parse the xml. It simply converts the XML string into python dictionary.

topic.xml

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<nite:root nite:id="ES2002a.topic" 
xmlns:nite="http://nite.sourceforge.net/">
<topic nite:id="ES2002a.topic.vkaraisk.1" other_description="introduction of participants and their roles">
      <nite:pointer role="scenario_topic_type"  href="default-topics.xml#id(top.4)"/>
      <nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words0)..id(ES2002a.B.words5)"/>
      <nite:child href="ES2002a.D.words.xml#id(ES2002a.D.words0)..id(ES2002a.D.words3)"/>
</topic>
</nite:root>




import jxmlease

with open('topic.xml') as topic:
    topic_content = topic.read()

root = jxmlease.parse(topic_content)
first_word_selection = root['nite:root']['topic']['nite:child'][0].get_xml_attr("href")

print(first_word_selection)
output : ES2002a.D.words.xml#id(ES2002a.D.words0)..id(ES2002a.D.words3)
Sign up to request clarification or add additional context in comments.

4 Comments

Hi Sijan. I justr tried running that code and i got the following errorTypeError: list indices must be integers or slices, not str
Hi @Pythonuser, I have added my topic.xml content above. Can you run that once? I is working fine at my end.
Hi @Sijan Bhandari, i ran the xml file, using the content and got the following error: KeyError: 'nite:root'
You have installed jxmlease right? You can try to print root variable and look whether it is parsing well or not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.