0

EDIT: XML File

-<corpus lang="en" id="subtask2-heterographic">


-<text id="het_1">

  <word id="het_1_1">'</word>

  <word id="het_1_2">'</word>

  <word id="het_1_3">I</word>

  <word id="het_1_4">'</word>

  <word id="het_1_5">m</word>

  <word id="het_1_6">halfway</word>

  <word id="het_1_7">up</word>

  <word id="het_1_8">a</word>

  <word id="het_1_9">mountain</word>

  <word id="het_1_10">,</word>

  <word id="het_1_11">'</word>

  <word id="het_1_12">'</word>

  <word id="het_1_13">Tom</word>

  <word id="het_1_14">alleged</word>

  <word id="het_1_15">.</word>

</text>


-<text id="het_2">

  <word id="het_2_1">I</word>

  <word id="het_2_2">'</word>

  <word id="het_2_3">d</word>

  <word id="het_2_4">like</word>

  <word id="het_2_5">to</word>

  <word id="het_2_6">be</word>

  <word id="het_2_7">a</word>

  <word id="het_2_8">Chinese</word>

  <word id="het_2_9">laborer</word>

  <word id="het_2_10">,</word>

  <word id="het_2_11">said</word>

  <word id="het_2_12">Tom</word>

  <word id="het_2_13">coolly</word>

  <word id="het_2_14">.</word>

 </text>
</corpus>

I am parsing an XML file on python and getting the text that I want. Each text tag represents a sentence in the XML file, and I want to put each sentence as separate list element inside a list.

tree = ET.ElementTree(file='subtask2-heterographic-test.xml')
root = tree.getroot()

lst = []

for elem in root:
    for w in elem:
        lst.append(w.text)

>> ["'", "'", 'I', "'", 'm', 'halfway', 'up', 'a', 'mountain', ',', "'", "'", 'Tom', 'alleged', '.', 'I', "'", 'd', 'like', 'to', 'be', 'a', 'Chinese', 'laborer', ',', 'said', 'Tom', 'coolly', '.', 'Dentists', ...]

This just gives all words in the XML file without separating the sentence. How can I fix it to put each sentence into the list as a list of strings?

Final expected output:

>> [["'", "'", 'I', "'", 'm', 'halfway', 'up', 'a', 'mountain', ',', "'", "'", 'Tom', 'alleged', '.'] , ['I', "'", 'd', 'like', 'to', 'be', 'a', 'Chinese', 'laborer', ',', 'said', 'Tom', 'coolly', '.'], ['Dentists', ...] ]
4
  • post your xml fragment at start Commented Oct 18, 2017 at 18:02
  • @RomanPerekhrest Sorry. Edited. Commented Oct 18, 2017 at 18:05
  • ok, we got the input. Now, post the final expected output please Commented Oct 18, 2017 at 18:06
  • @RomanPerekhrest Done. Thanks Commented Oct 18, 2017 at 18:09

1 Answer 1

1

You have to create a new list for each sentence:

sentences = []
for elem in root:
    sentence = []
    for w in elem:
        sentence.append(w.text)
    sentences.append(sentence)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.