1

scraping a web site and receiving a HTML page.

The page has some tables with rows

(actor -> role)

For example:

( actor = Jason Priestley -> role = Brandon Walsh)

Sometimes there are rows that are missing the "actor" or the "role"

(rows with 1 column when expecting 2)

File example :

<div id="90210">
      <h2 style="margin:0 0 2px 0">beverly hills 90210</h2>
      <table class="actors">
        <tr><td class="actor">Jennie Garth</td><td class="role">Kelly Taylor</td></tr>
        <tr><td class="actor">Shannen Doherty</td></tr>
        <tr><td class="actor">Jason Priestley</td><td class="role">Brandon Walsh</td></tr>
      </table>
</div>

Having trouble filtering out the rows with 1 column only :

my code:

  def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {
    val beverlyHillsData = page \\ "div" find ((node: xml.Node) => (node \ "id").text == "90210")
    beverlyHillsData match {
      case Some(data) => {
        val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
        val actors = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "actor") map { _.text }
        val roles  = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "role")  map {_.text}
        actors zip roles  toMap
      }
      case None => Map()
    }
  }

Main concerns is with the line :

val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )

How can i filter out the bad rows doing it more precise (without the _.toString() )

Any suggestions ?

1 Answer 1

1

You can

def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

val goodRows = data \\ "tr" filter actorWithRole

I'd also change the data extraction to have actors/role pairs intact. I need more time to figure out a clean solution

What I suggest is

def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {

  def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

  def rowToEntry(r: Node) =
    r \ "td" map (_.text) match {
      case actor :: role :: Nil => (actor -> role)
    }  

  val beverlyHillsData = page \\ "div" find whereId("90210")

  beverlyHillsData match {
    case Some(data) => {
      val goodRows = data \\ "tr" filter actorWithRole
      val entries = goodRows map rowToEntry
      entries.toMap
    }
    case None => Map()
  }
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.