0

I am trying to load valid html for processing in Scala. Seems like converting to xml would be a good starting point. It looks like very nice code at the somewhat controversial scala.xml.Xhtml Scala core library for doing that. Basically it should entail 'fixing up' tags that are valid in html but not valid xml and hence preventing the document from being valid xhtml, and just a bit more. Here is the code from there:

def toXhtml(
    x: Node,
    pscope: NamespaceBinding = TopScope,
    sb: StringBuilder = new StringBuilder,
    stripComments: Boolean = false,
    decodeEntities: Boolean = false,
    preserveWhitespace: Boolean = false,
    minimizeTags: Boolean = true): Unit =
  {
    def decode(er: EntityRef) = XhtmlEntities.entMap.get(er.entityName) match {
      case Some(chr) if chr.toInt >= 128  => sb.append(chr)
      case _                              => er.buildString(sb)
    }
    def shortForm =
      minimizeTags &&
      (x.child == null || x.child.length == 0) &&
      (minimizableElements contains x.label)

    x match {
      case c: Comment                       => if (!stripComments) c buildString sb
      case er: EntityRef if decodeEntities  => decode(er)
      case x: SpecialNode                   => x buildString sb
      case g: Group                         =>
        g.nodes foreach { toXhtml(_, x.scope, sb, stripComments, decodeEntities, preserveWhitespace, minimizeTags) }

      case _  =>
        sb.append('<')
        x.nameToString(sb)
        if (x.attributes ne null) x.attributes.buildString(sb)
        x.scope.buildString(sb, pscope)

        if (shortForm) sb.append(" />")
        else {
          sb.append('>')
          sequenceToXML(x.child, x.scope, sb, stripComments, decodeEntities, preserveWhitespace, minimizeTags)
          sb.append("</")
          x.nameToString(sb)
          sb.append('>')
        }
    }
  }

What seems to take some excessive perseverance is finding how to use that function for an existing html document that has been fetched with scala.io.Source(fromFile). The meaning of the Node type seems a bit elusive in the code base, or I am unsure how to get from the string received from scala.io.Source's fromFile, to something that can be fed into the above copied function toXhtml.

The scaladoc for this function doesn't seem to clarify much.

There's also another related library where the scaladoc only has a zillion entries in it.

I'd be very happy if anyone can say how can a raw html string be converted to 'clean' xhtml using this library, and walk through how to deduce that from the source code, as my Scala is probably not that good I see..

1 Answer 1

2

You might consider using jsoup for this since it excels at dealing with messy, real-world HTML. It can also scrub HTML based on a whitelist of allowed tags. An example:

import org.jsoup.Jsoup
import org.jsoup.safety.Whitelist
import scala.collection.JavaConversions._
import scala.io.Source

object JsoupExample extends App {
  val suspectHtml = Source.fromURL("http://en.wikipedia.org/wiki/Scala_(programming_language)").mkString
  val cleanHtml = Jsoup.clean(suspectHtml, Whitelist.basic)
  val doc = Jsoup.parse(cleanHtml)
  doc.select("p").foreach(node => println(node.text))
}
Sign up to request clarification or add additional context in comments.

1 Comment

Yep, I was just hoping there'd be a pure Scala solution, where I can quickly understand the code and change it if necessary. I may adopt that Scala function I mentioned as it's so concise... as I feel I have little idea how liberal or original might jsoup be on inputs I haven't seen yet. The scala code is very concise to follow.. only it's input plumbing is a bit dodgy.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.