How to extract HTML Code from a XML File using groovy

Question

I have this XML file I need to extract the HTML Code from "mono" element but I need the html tags. I need to use groovy programming language.

All the divs inside "mono" element are HTML Tags including the divs

thank you in advance.

<dataset>
    <chapters>
        <chapter id="700" name="Immunology">
            <title>Immunology</title>   
            <monos>
                <mono id="382727">

                    <div>
                        <h1>blah blah</h1>
                    </div>
                    <div>
                        <p>blah blah</p>
                    </div>

            </mono>
            </monos>
        </chapter>  
        <chapter id="701" name="hematology">
            <title>Inmuno Hematology</title>    
            <monos>
                <mono id="blah blah">
                    <div>
                        <h1>blah blah</h1>
                    </div>
                    <div>
                        <div class="class1">blah blah</div>
                    </div>
                </mono>
            </monos>
        </chapter>
    </chapters>
</dataset>

I have tried :

import javax.xml.parsers.*;

xml = new XmlParser().parse("languages.xml")

println("There are " +xml.chapters.chapter.size() +" Chapters")

for (int i = 0; i < xml.chapters.chapter.size(); i++) {

            def chapter = xml.chapters.chapter[i]
            def chapterName = chapter.'@name'
            println chapterName

            println("----  Monos List ----\n\n")


            for (int j = 0; j < chapter.monos.mono.size(); j++) {

                        def mono = chapter.monos.mono[j]
                        println("Mono Content: " + mono.toString());
            }

           println("---- End Monos List ----\n\n")

}

But I just get the following ouput:

There are 2 Chapters Immunology ---- Monos List ----

Mono Content: mono[attributes={id=382727}; value=[div[attributes={}; value=[h1[attributes={}; value=[blah blah]]]], div[attributes={}; value=[p[attributes={}; value=[blah blah]]]]]] ---- End Monos List ----

hematology ---- Monos List ----

Mono Content: mono[attributes={id=blah blah}; value=[div[attributes={}; value=[h1[attributes={}; value=[blah blah]]]], div[attributes={}; value=[div[attributes={class=class1}; value=[blah blah]]]]]] ---- End Monos List ----

I have tried th following code but it give me the output:There are 2 Chapters Immunology ---- Monos List ---- Mono Content: mono[attributes={id=382727}; value=[div[attributes={}; value=[h1[attributes={}; value=[blah blah]]]], div[attributes={}; value=[p[attributes={}; value=[blah blah]]]]]] ---- End Monos List ---- hematology ---- Monos List ---- Mono Content: mono[attributes={id=blah blah}; value=[div[attributes={}; value=[h1[attributes={}; value=[blah blah]]]], div[attributes={}; value=[div[attributes={class=class1}; value=[blah blah]]]]]] ---- End Monos List ---- — Jaime Alvarez
– Jaime Alvarez, Commented Apr 30, 2012 at 17:17

Luis Muñiz · Accepted Answer · 2012-04-30 17:57:39Z

3

import groovy.xml.*

def src="""
<dataset>
    <chapters>
        <chapter id="700" name="Immunology">
            <title>Immunology</title>   
            <monos>
                <mono id="382727">

                    <div>
                        <h1>blah blah</h1>
                    </div>
                    <div>
                        <p>blah blah</p>
                    </div>

            </mono>
            </monos>
        </chapter>  
        <chapter id="701" name="hematology">
            <title>Inmuno Hematology</title>    
            <monos>
                <mono id="blah blah">
                    <div>
                        <h1>blah blah</h1>
                    </div>
                    <div>
                        <div class="class1">blah blah</div>
                    </div>
                </mono>
            </monos>
        </chapter>
    </chapters>
</dataset>
"""

def parsed=new XmlSlurper().parseText(src)

parsed.'**'.findAll{it.name()=='mono'}.each{mono->
    mono.children().each {htmlElement->
        println new StreamingMarkupBuilder().bind{out << htmlElement}.toString()
    }
}

answered Apr 30, 2012 at 17:57

Luis Muñiz

4,8591 gold badge30 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Luis Muñiz Over a year ago

Larry Battle's answer was there first, but i'd still use StreamingMarkupBuilder instead of XmlSerializer, to avoid the XML prolog.

Larry Battle Over a year ago

Your answer is better than mine. I had a hard time finding out how to use StreamingMarkupBuilder with XmlSlurper, which is why I used XmlUtil.

Luis Muñiz Over a year ago

@Larry: The pain of parsing XML in java is such that i couldn't help it. Everytime i do this in groovy it exorcises years of torture doing it in java... Jaime: Also, if you have a Writer or another streaming interface, the result of bind() is a Writable, so you can use bind{...}.writeTo(myWriter)

Larry Battle · Accepted Answer · 2012-04-30 17:44:42Z

2

You can use XmlSlurper or XmlParser to parse xml content.

http://groovy.codehaus.org/Reading+XML+using+Groovy's+XmlSlurper http://groovy.codehaus.org/Reading+XML+using+Groovy's+XmlParser

import groovy.xml.*
def RECORDS = '''
        <dataset>
        <chapters>
            <chapter id="700" name="Immunology">
                <title>Immunology</title>   
                <monos>
                    <mono id="382727">

                            <div>
                                <h1>blah blah</h1>
                            </div>
                            <div>
                                <p>blah blah</p>
                            </div>

                    </mono>
                </monos>
                </chapter>    
                <chapter id="701" name="hematology">
                    <title>Inmuno Hematology</title>    
                    <monos>
                        <mono id="blah blah">
                            <div>
                                <h1>blah blah</h1>
                            </div>
                            <div>
                                <div class="class1">blah blah</div>
                            </div>
                        </mono>
                    </monos>
                </chapter>
            </chapters>
        </dataset>    
  '''
def records = new XmlSlurper().parseText(RECORDS)
def monos = records.depthFirst().findAll{ it.name().equals('mono') }
assert monos[0].toString() == "blah blahblah blah";
XmlUtil.serialize( monos[0] );

Outputs:

<?xml version="1.0" encoding="UTF-8"?><mono id="382727">
  <div>
    <h1>blah blah</h1>
  </div>
  <div>
    <p>blah blah</p>
  </div>
</mono>

answered Apr 30, 2012 at 17:44

Larry Battle

9,1985 gold badges43 silver badges55 bronze badges

3 Comments

Luis Muñiz Over a year ago

sorry, almost identical answers, LOL

Jaime Alvarez Over a year ago

Thank you a lot! for yours answers. Is it possible to get only the text without <?xml version .... "UTF-8"?> header? preserving the html tags?

Larry Battle Over a year ago

loteq's answer does that by using StreamingMarkupBuilder instead of XMLUtil.

Collectives™ on Stack Overflow

How to extract HTML Code from a XML File using groovy

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related