9

I am developing web app using Scala and Lift framework. I have record in DB which contains html perex of page

<b>Hi all, this is perex</b>

And in one scenario I need to print to user this perex, but without html tags.

Hi all, this is perex

It is possible to do this in Scala? Because I tried to look with Google, but with no success.

thanks for all replies.

3 Answers 3

8

If the string is valid XML then you can use:

scala.xml.XML.loadString("<b>Hi all, this is parex</b>").text

If it's not valid XML, then you can use scala.util.matching.Regex or an HTML parsing library like http://jsoup.org/

Sign up to request clarification or add additional context in comments.

2 Comments

I am using your solution right now and it seems it works. I had to wrap my String to <span> tag to work even with "<b>Hi</b> name" string. Thanks a lot.
I'd rather use scala.xml.parsing.XhtmlParser to parse the HTML. Better chance of parsing it correctly.
0

The best solution I've found was to use cyberneko to parse your string and do some "clever" SAX event handling.

cyberneko will parse your HTML even if it's invalid, which is the case for the vast majority of the HTML you're likely to encounter in the wild.

If you register a custom ContentHandler that essentially ignores all but the character events and just append those to a string builder, you'll get a good first approximation, with an annoying flaw: words separated by a block element will end up concatenated (for<br/>example => forexample).

A better solution is to get a list of all block elements, and have your ContentHandler listen to startElement events. If the element is a block one, just append a space character to your string builder.

Note that while this seems to work fine, it might not be perfect for your use case. <br/> is not, for example, turned into a line break. It shouldn't be too much work to add this if it's required, though.

Comments

0

TagSoup should meet your requirement to parse a realworld html file.

sbt dependencies,

libraryDependencies += "org.ccil.cowan.tagsoup" % "tagsoup" % "1.2.1"

Sample code,

object TagSoupXmlLoader {

  private val factory = new SAXFactoryImpl()

  def get(): XMLLoader[Elem] = {
    XML.withSAXParser(factory.newSAXParser())
  }
}

usage,

val root = TagSoupXmlLoader.get().load("http://www.google.com")
println(root)

1 Comment

For those who wonder, this example needs a number of imports. I use sbt, so I load TagSoup and other dependencies using sbt: libraryDependencies += "org.ccil.cowan.tagsoup" % "tagsoup" % "1.2.1" and then the import: import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl. You also need to import Scala XML, e.g.: libraryDependencies += "org.scala-lang.modules" %% "scala-xml" % "1.1.1" with import scala.xml.{Elem, XML} and finally, import scala.xml.factory.XMLLoader

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.