2

I need to extract attribute values from <Item Name="CanonicalSmiles"> from following XML file (part is shown) ?

I tried getElementsByTagName("Item").item(12).getTextContent()); But for different <DocSum>s item(i) is different (ie not 12 always!)

How do I do this??

  <?xml version="1.0"?>
    <!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 29 October 2004//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_041029.dtd">
    <eSummaryResult>
    <DocSum>
        <Id>53359352</Id>
        <Item Name="CID" Type="Integer">53359352</Item>
        <Item Name="SourceNameList" Type="List"></Item>
        <Item Name="SourceIDList" Type="List"></Item>
        <Item Name="SourceCategoryList" Type="List">
            <Item Name="string" Type="String">Journal Publishers</Item>
        </Item>
        <Item Name="CreateDate" Type="Date">2011/09/19 00:00</Item>
        <Item Name="SynonymList" Type="List"></Item>
        <Item Name="MeSHHeadingList" Type="List"></Item>
        <Item Name="MeSHTermList" Type="List"></Item>
        <Item Name="PharmActionList" Type="List"></Item>
        <Item Name="CommentList" Type="List"></Item>
        <Item Name="IUPACName" Type="String">2-hydroxy-6-[2-(4-hydroxyphenyl)-2-oxoethyl]benzoic acid</Item>
        <Item Name="CanonicalSmiles" Type="String">C1=CC(=C(C(=C1)O)C(=O)O)CC(=O)C2=CC=C(C=C2)O</Item>
        <Item Name="RotatableBondCount" Type="Integer">4</Item>
        <Item Name="MolecularFormula" Type="String">C15H12O5</Item>
        <Item Name="MolecularWeight" Type="String">272.252780</Item>
        <Item Name="TotalFormalCharge" Type="Integer">0</Item>
        <Item Name="XLogP" Type="String"></Item>
        <Item Name="HydrogenBondDonorCount" Type="Integer">3</Item>
        <Item Name="HydrogenBondAcceptorCount" Type="Integer">5</Item>
        <Item Name="Complexity" Type="String">359.000000</Item>
        <Item Name="HeavyAtomCount" Type="Integer">20</Item>
        <Item Name="AtomChiralCount" Type="Integer">0</Item>
        <Item Name="AtomChiralDefCount" Type="Integer">0</Item>
        <Item Name="AtomChiralUndefCount" Type="Integer">0</Item>
        <Item Name="BondChiralCount" Type="Integer">0</Item>
        <Item Name="BondChiralDefCount" Type="Integer">0</Item>
        <Item Name="BondChiralUndefCount" Type="Integer">0</Item>
        <Item Name="IsotopeAtomCount" Type="Integer">0</Item>
        <Item Name="CovalentUnitCount" Type="Integer">1</Item>
        <Item Name="TautomerCount" Type="Integer">67</Item>
        <Item Name="SubstanceIDList" Type="List"></Item>
        <Item Name="TPSA" Type="String">94.8</Item>
        <Item Name="AssaySourceNameList" Type="List"></Item>
        <Item Name="MinAC" Type="String"></Item>
        <Item Name="MaxAC" Type="String"></Item>
        <Item Name="MinTC" Type="String"></Item>
        <Item Name="MaxTC" Type="String"></Item>
        <Item Name="ActiveAidCount" Type="Integer">0</Item>
        <Item Name="InactiveAidCount" Type="Integer">0</Item>
        <Item Name="TotalAidCount" Type="Integer">0</Item>
        <Item Name="InChIKey" Type="String">YIGHIFUVVSYMFG-UHFFFAOYSA-N</Item>
        <Item Name="InChI" Type="String">InChI=1S/C15H12O5/c16-11-6-4-9(5-7-11)13(18)8-10-2-1-3-12(17)14(10)15(19)20/h1-7,16-17H,8H2,(H,19,20)</Item>
    </DocSum>

    <DocSum>
        <Id>53346823</Id>
        <Item Name="CID" Type="Integer">53346823</Item>
        <Item Name="SourceNameList" Type="List"></Item>
        <Item Name="SourceIDList" Type="List"></Item>
        <Item Name="SourceCategoryList" Type="List">
            <Item Name="string" Type="String">Biological Properties</Item>
        </Item>
        <Item Name="CreateDate" Type="Date">2011/09/01 00:00</Item>
        <Item Name="SynonymList" Type="List">
            <Item Name="string" Type="String">HMS2478O14</Item>
        </Item>
        <Item Name="MeSHHeadingList" Type="List"></Item>
        <Item Name="MeSHTermList" Type="List"></Item>
        <Item Name="PharmActionList" Type="List"></Item>
        <Item Name="CommentList" Type="List">
            <Item Name="string" Type="String">Asinex Ltd.:BAS 02768155</Item>
        </Item>
        <Item Name="IUPACName" Type="String">ethyl 3-amino-3-(1,3-benzodioxol-5-yl)propanoate chloride</Item>
        <Item Name="CanonicalSmiles" Type="String">CCOC(=O)CC(C1=CC2=C(C=C1)OCO2)N.[Cl-]</Item>
        <Item Name="RotatableBondCount" Type="Integer">5</Item>
        <Item Name="MolecularFormula" Type="String">C12H15ClNO4-</Item>
        <Item Name="MolecularWeight" Type="String">272.704800</Item>
        <Item Name="TotalFormalCharge" Type="Integer">-1</Item>
        <Item Name="XLogP" Type="String"></Item>
        <Item Name="HydrogenBondDonorCount" Type="Integer">1</Item>
        <Item Name="HydrogenBondAcceptorCount" Type="Integer">6</Item>
        <Item Name="Complexity" Type="String">271.000000</Item>
        <Item Name="HeavyAtomCount" Type="Integer">18</Item>
        <Item Name="AtomChiralCount" Type="Integer">1</Item>
        <Item Name="AtomChiralDefCount" Type="Integer">0</Item>
        <Item Name="AtomChiralUndefCount" Type="Integer">1</Item>
        <Item Name="BondChiralCount" Type="Integer">0</Item>
        <Item Name="BondChiralDefCount" Type="Integer">0</Item>
        <Item Name="BondChiralUndefCount" Type="Integer">0</Item>
        <Item Name="IsotopeAtomCount" Type="Integer">0</Item>
        <Item Name="CovalentUnitCount" Type="Integer">2</Item>
        <Item Name="TautomerCount" Type="Integer">1</Item>
        <Item Name="SubstanceIDList" Type="List"></Item>
        <Item Name="TPSA" Type="String">70.8</Item>
        <Item Name="AssaySourceNameList" Type="List"></Item>
        <Item Name="MinAC" Type="String"></Item>
        <Item Name="MaxAC" Type="String"></Item>
        <Item Name="MinTC" Type="String"></Item>
        <Item Name="MaxTC" Type="String"></Item>
        <Item Name="ActiveAidCount" Type="Integer">0</Item>
        <Item Name="InactiveAidCount" Type="Integer">0</Item>
        <Item Name="TotalAidCount" Type="Integer">0</Item>
        <Item Name="InChIKey" Type="String">NKQHQIJWIYNEIX-UHFFFAOYSA-M</Item>
        <Item Name="InChI" Type="String">InChI=1S/C12H15NO4.ClH/c1-2-15-12(14)6-9(13)8-3-4-10-11(5-8)17-7-16-10;/h3-5,9H,2,6-7,13H2,1H3;1H/p-1</Item>
    </DocSum>
4
  • 1
    You might want to reduce the sample XML and convert it to a canonical form. You will get better attention that way. Commented Sep 21, 2011 at 21:50
  • I would use an xpath compatible dom parser such as dom4j API. Commented Sep 21, 2011 at 21:51
  • 1
    @Usman: Standard Java does XPath, so why need dom4j? Commented Sep 21, 2011 at 22:02
  • @HovercraftFullOfEels I like dom4j :-D Commented Sep 21, 2011 at 22:04

4 Answers 4

3

For what you're doing, XPath is likely easier than DOM. See this Java XPath tutorial.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the link to tutorial
1
    XPathFactory xpf = XPathFactory.newInstance();
    XPath xp = xpf.newXPath();
    XPathExpression xe = xp.compile("//DocSum/Item[@Name='CanonicalSmiles']/text()");
    NodeList nodes = (NodeList)xe.evaluate(yourdom, XPathConstants.NODESET);

1 Comment

Thanks it did the trick. Then I printed the CanonicalSmiles with for (int i=0;i<nodes.getLength();i++) { System.out.println(nodes.item(i).getNodeValue()); }
0

As others have pointed out, XPath is the standard way to go. If you're using a tool like jOOX, writing XPath is even simpler:

String text = $(document).xpath("//DocSum/Item[@Name='CanonicalSmiles']").text();

With jOOX, you don't need to use XPath, however. You could also use jOOX's jQuery-like API directly, for instance using filters:

String text = $(document).find("Item")
                         .filter(attr("Name", "CanonicalSmiles"))
                         .text();

Or by using CSS-style selectors:

String text = $(document).find("Item[Name='CanonicalSmiles']").text();

2 Comments

Interestingly enough I am/was working an API almost exactly like yours. - github.com/jjnguy/Jinq2XML
@jjnguy: Then stop, and contribute to jOOX! :-) jOOX would probably still be missing a LINQ-style API. I'll check out what you already have. Like your class names: Jode, Jocument ;-)
0

As I see, the problem of parser each time reading XML elements in different order remained still unanswered.

XML has not any order of elements. You can't wait that the element read as num. 12 today will be num. 12 tomorrow. The only way to number your elements is go give them numbers explicitely.

<Item Name="TotalFormalCharge" Type="Integer">-1</Item>

will become:

<Item Name="TotalFormalCharge" Num=6 Type="Integer">-1</Item>

And you can get it by the attribute value.

2 Comments

This question has been answered. See the answer above. Anyway, it has been solved and implemented in the software 2 years ago.
1.You have more than one problem set there :"But for different <DocSum>s item(i) is different (ie not 12 always!)" remained unanswered. I answered it. 2. The post here are not for the question author only, but for other people as well, for smb. else it could be useful. Here are also some badges that are granted for people, who could usefully reanswer an old question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.