Parse HTML in Scala

Question

Task: HTML - Parser in Scala. Im pretty new to scala.

So far: I have written a little Parser in Scala to parse a random html document.

import scala.xml.Elem
import scala.xml.Node
import scala.collection.mutable.Queue
import scala.xml.Text
import scala.xml.PrettyPrinter

object Reader {
  def loadXML = {
    val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
    val parser = parserFactory.newSAXParser()
    val source = new org.xml.sax.InputSource("http://www.randomurl.com")
    val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
    val feed = adapter.loadXML(source, parser)
    feed
  }

  def proc(node: Node): String =
    node match {
      case <body>{ txt }</body> => "Partial content: " + txt
      case _ => "grmpf"
    }

  def main(args: Array[String]): Unit = {
    val content = Reader.loadXML
    Console.println(content)
    Console.println(proc(content))

  }
}

The problem is that the "proc" does not work. Basically, I would like to get exactly the content of one node. Or is there another way to achieve that without matching?
Does the "feed" in the loadxml-function give me back the right format for parsing or is there a better way to achieve that? Feed gives me back the root node, right?

Thanks in advance

Parsing HTML as XML is never a good idea. There are some nice Java libs for that purpose. Jsoup is one of them. — Nikita Volkov
– Nikita Volkov, Commented Aug 22, 2012 at 21:39
@NikitaVolkov: That's why the asker is using the TagSoup parser, which gives you a nice SAX interface to non-XML HTML. — Travis Brown
– Travis Brown, Commented Aug 22, 2012 at 21:55

Travis Brown · Accepted Answer · 2012-08-22 21:06:13Z

3

You're right: adapter.loadXML(source, parser) gives you the root node. The problem is that that root node probably isn't going to match the body case in in your proc method. Even if the root node were body, it still wouldn't match unless the element contained nothing but text.

You probably want something more like this:

def proc(node: Node): String = (node \\ "body").text

Where \\ is a selector method that's roughly equivalent to XPath's //—i.e., it returns all the descendants of node named body. If you know that body is a child (as opposed to a deeper descendant) of the root node, which is probably the case for HTML, you can use \ instead of \\.

answered Aug 22, 2012 at 21:06

Travis Brown

139k12 gold badges384 silver badges689 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

whereismydipp Over a year ago

Thx a lot travis :). May I ask an additional question...Is there a way to give me all nodes in a hierarchy like a tree? Do I have to do this recursive or is there another way? I mean like i.e. html -head -body --div1 --div2 Thx in advance

Travis Brown Over a year ago

I'm not sure I understand the question. Node (or more specifically Elem) does give you a tree—it has a child method that returns its children, although it's usually easier to use the selectors to navigate the tree.

Collectives™ on Stack Overflow

Parse HTML in Scala

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related