0

I have this XML file I need to extract the HTML Code from "mono" element but I need the html tags. I need to use groovy programming language.

All the divs inside "mono" element are HTML Tags including the divs

thank you in advance.

<dataset>
    <chapters>
        <chapter id="700" name="Immunology">
            <title>Immunology</title>   
            <monos>
                <mono id="382727">

                    <div>
                        <h1>blah blah</h1>
                    </div>
                    <div>
                        <p>blah blah</p>
                    </div>

            </mono>
            </monos>
        </chapter>  
        <chapter id="701" name="hematology">
            <title>Inmuno Hematology</title>    
            <monos>
                <mono id="blah blah">
                    <div>
                        <h1>blah blah</h1>
                    </div>
                    <div>
                        <div class="class1">blah blah</div>
                    </div>
                </mono>
            </monos>
        </chapter>
    </chapters>
</dataset>

I have tried :

import javax.xml.parsers.*;

xml = new XmlParser().parse("languages.xml")

println("There are " +xml.chapters.chapter.size() +" Chapters")

for (int i = 0; i < xml.chapters.chapter.size(); i++) {

            def chapter = xml.chapters.chapter[i]
            def chapterName = chapter.'@name'
            println chapterName

            println("----  Monos List ----\n\n")


            for (int j = 0; j < chapter.monos.mono.size(); j++) {

                        def mono = chapter.monos.mono[j]
                        println("Mono Content: " + mono.toString());
            }

           println("---- End Monos List ----\n\n")

}

But I just get the following ouput:

There are 2 Chapters Immunology ---- Monos List ----

Mono Content: mono[attributes={id=382727}; value=[div[attributes={}; value=[h1[attributes={}; value=[blah blah]]]], div[attributes={}; value=[p[attributes={}; value=[blah blah]]]]]] ---- End Monos List ----

hematology ---- Monos List ----

Mono Content: mono[attributes={id=blah blah}; value=[div[attributes={}; value=[h1[attributes={}; value=[blah blah]]]], div[attributes={}; value=[div[attributes={class=class1}; value=[blah blah]]]]]] ---- End Monos List ----

3
  • where's the html at? The mono tag? Commented Apr 30, 2012 at 16:51
  • I have tried th following code but it give me the output:There are 2 Chapters Immunology ---- Monos List ---- Mono Content: mono[attributes={id=382727}; value=[div[attributes={}; value=[h1[attributes={}; value=[blah blah]]]], div[attributes={}; value=[p[attributes={}; value=[blah blah]]]]]] ---- End Monos List ---- hematology ---- Monos List ---- Mono Content: mono[attributes={id=blah blah}; value=[div[attributes={}; value=[h1[attributes={}; value=[blah blah]]]], div[attributes={}; value=[div[attributes={class=class1}; value=[blah blah]]]]]] ---- End Monos List ---- Commented Apr 30, 2012 at 17:17
  • You're missing a closing chapters tag. Commented Apr 30, 2012 at 17:45

2 Answers 2

3
import groovy.xml.*

def src="""
<dataset>
    <chapters>
        <chapter id="700" name="Immunology">
            <title>Immunology</title>   
            <monos>
                <mono id="382727">

                    <div>
                        <h1>blah blah</h1>
                    </div>
                    <div>
                        <p>blah blah</p>
                    </div>

            </mono>
            </monos>
        </chapter>  
        <chapter id="701" name="hematology">
            <title>Inmuno Hematology</title>    
            <monos>
                <mono id="blah blah">
                    <div>
                        <h1>blah blah</h1>
                    </div>
                    <div>
                        <div class="class1">blah blah</div>
                    </div>
                </mono>
            </monos>
        </chapter>
    </chapters>
</dataset>
"""

def parsed=new XmlSlurper().parseText(src)

parsed.'**'.findAll{it.name()=='mono'}.each{mono->
    mono.children().each {htmlElement->
        println new StreamingMarkupBuilder().bind{out << htmlElement}.toString()
    }
}
Sign up to request clarification or add additional context in comments.

3 Comments

Larry Battle's answer was there first, but i'd still use StreamingMarkupBuilder instead of XmlSerializer, to avoid the XML prolog.
Your answer is better than mine. I had a hard time finding out how to use StreamingMarkupBuilder with XmlSlurper, which is why I used XmlUtil.
@Larry: The pain of parsing XML in java is such that i couldn't help it. Everytime i do this in groovy it exorcises years of torture doing it in java... Jaime: Also, if you have a Writer or another streaming interface, the result of bind() is a Writable, so you can use bind{...}.writeTo(myWriter)
2

You can use XmlSlurper or XmlParser to parse xml content.

http://groovy.codehaus.org/Reading+XML+using+Groovy's+XmlSlurper http://groovy.codehaus.org/Reading+XML+using+Groovy's+XmlParser

import groovy.xml.*
def RECORDS = '''
        <dataset>
        <chapters>
            <chapter id="700" name="Immunology">
                <title>Immunology</title>   
                <monos>
                    <mono id="382727">

                            <div>
                                <h1>blah blah</h1>
                            </div>
                            <div>
                                <p>blah blah</p>
                            </div>

                    </mono>
                </monos>
                </chapter>    
                <chapter id="701" name="hematology">
                    <title>Inmuno Hematology</title>    
                    <monos>
                        <mono id="blah blah">
                            <div>
                                <h1>blah blah</h1>
                            </div>
                            <div>
                                <div class="class1">blah blah</div>
                            </div>
                        </mono>
                    </monos>
                </chapter>
            </chapters>
        </dataset>    
  '''
def records = new XmlSlurper().parseText(RECORDS)
def monos = records.depthFirst().findAll{ it.name().equals('mono') }
assert monos[0].toString() == "blah blahblah blah";
XmlUtil.serialize( monos[0] );

Outputs:

<?xml version="1.0" encoding="UTF-8"?><mono id="382727">
  <div>
    <h1>blah blah</h1>
  </div>
  <div>
    <p>blah blah</p>
  </div>
</mono>

3 Comments

sorry, almost identical answers, LOL
Thank you a lot! for yours answers. Is it possible to get only the text without <?xml version .... "UTF-8"?> header? preserving the html tags?
loteq's answer does that by using StreamingMarkupBuilder instead of XMLUtil.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.