1

I have a xml file with a structure like this:

<?xml version="1.0"?>
<entries>
  <entry accente="one">
    <list>Word</list>
    <sense class="0" value="B">
      <definition>
        <MorfDef>s. m.</MorfDef>
        <RegDef>This <i>text</i> have i node.</RegDef>
        <ItalMarker>Text.</ItalMarker>
      </definition>
    </sense>
   </entry>
  <entry accente="two">
    <list>B  n-1</list>
    <sense class="0" value="B">
      <definition>
        <MorfDef>s. m.</MorfDef>
        <RegDef>This text doesn't have i atribute.</RegDef>
        <ItalMarker>Word.</ItalMarker>
      </definition>
    </sense>
   </entry>
</entries>

I want to add a new node for each word in the RegDef element, so the result could be:

<?xml version="1.0"?>
<entries>
  <entry accente="one">
    <list>Word</list>
    <sense class="0" value="B">
      <definition>
        <MorfDef>s. m.</MorfDef>
        <RegDef><w lemma="A1">This</w> <i><w lemma="A2">text</w></i> <w lemma="A3">have</w> <w lemma="A4">i</w> <w lemma="A5">node</w> <w lemma="A6">.</w></RegDef>
        <ItalMarker>Text.</ItalMarker>
      </definition>
    </sense>
   </entry>
  <entry accente="two">
    <list>B  n-1</list>
    <sense class="0" value="B">
      <definition>
        <MorfDef>s. m.</MorfDef>
        <RegDef><w lemma="A7">This</w> <w lemma="A8">text</w> <w lemma="A8">doesn't</w> <w lemma="A10">have</w> <w lemma="A11">i</w> <w lemma="A12">atribute</w> <w lemma="A13">.</w></RegDef>
        <ItalMarker>Word.</ItalMarker>
      </definition>
    </sense>
   </entry>
</entries>

If the RegDef node have a < i > node I want to read the text fron the < i > node and write a < w > node for each word. I tried to use XPath like below:

 Element rootElement = document.getDocumentElement();
Element element = document.createElement("w");
rootElement.appendChild(element);

but it appends right after the root node. How can i write a node for each word in RegDef tag and then add an attribute to that node? Thank you.

1
  • I added an solution based on a fragment of your file. I hope you can use it as a starting point. Commented Jun 14, 2014 at 3:05

1 Answer 1

1

You selected the root node of your file <entries>. If you use appendChild on that node, your node will be appended as the last child of the root node, which is the expected behaviour.

You actually want to wrap words inside the RegDef node with the w element, which is not a task as simple as the three lines of code you included in your example.

For that you will need to:

  1. Select that node (there are many methods in the DOM, document.getElementsByTagName("RegDef") will give you a NodeList containing all of them. You can also use XPath.
  2. For each RegDef you will need to select all its descendant text nodes. If you use XPath an expression such as .//text() in the context of each RegDef will give you a list of those nodes. Each one may contain one or more "words", or even empty spaces and newlines.
  3. You can extract the words by splitting by spaces or punctuation marks or other characters that can be used as delimiters for a word. There are several tools for that in Java, including regular expressions.
  4. Finally, when you have isolated each individual "word", and eliminated the nodes you want to ignore, you can create a w element for each one, create a new text node containing the word, and append the text node as a child of that element. You will also have to set attributes.

Perhaps you should use a smaller XML file to focus on your specific problem, and later adapt it to your real world example. You could start with something like this:

String xml = "<nodes>\n"
        + "    <RegDef>This <i>text</i> have i node.</RegDef>\n"
        + "    <RegDef>This text doesn't have i atribute.</RegDef>\n"
        + "</nodes>";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbf.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(xml)));

NodeList regDefNodes = document.getElementsByTagName("RegDef");
int size = regDefNodes.getLength();
for(int i = 0; i < size; i++) {
    Element regDef = (Element)regDefNodes.item(i);
    Element newRegDef = wrapWordsInContents(regDef, document);
    Element parent = (Element)regDef.getParentNode();
    parent.replaceChild(newRegDef, regDef);
}

Now you can use the steps above as a guide and write the wrapWordsInContents(Element e, Document doc) method.

UPDATE: You asked about tokenizing the content in a followup question which contains the wrapWordsInContents(Element e, Document doc) method. After you call that method and serialize the code above with:

Transformer t = TransformerFactory.newInstance().newTransformer();
t.transform(new DOMSource(document), new StreamResult(System.out));

you will have a result similar to the one you expect. See your followup question: Modify the text content of XML tag

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.