Parsing html in haskell

Question

I'm trying to parse the a links from the main part (<article>) of a blog post. I have adapted what I found on FPComplete but nothing is printed out. (The code does not work as far as I can see as running it on the online IDE and with the Bing target also produces no links.)

In GHCI I can simulate the first line of parseAF and that gets me a large record, which I take to be correct. But cursor $// findNodes &| extractData returns []

I've tried regex but that wasn't happy trying to find such a long piece of text.

Can anyone help?

{-# LANGUAGE OverloadedStrings #-}

module HtmlParser where

import Network.HTTP.Conduit (simpleHttp)
import Prelude hiding (concat, putStrLn)
import Data.Text (concat)
import Data.Text.IO (putStrLn)
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (Cursor, attribute, element, fromDocument, ($//), (&//), (&/), (&|))

-- The URL we're going to search
url = "http://www.amsterdamfoodie.nl/2015/wine-beer-food-restaurants-troost/"

-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = element "article" &/ element "a"

-- Extract the data from each node in turn
extractData = concat . attribute "href"

cursorFor :: String -> IO Cursor
cursorFor u = do
     page <- simpleHttp u
     return $ fromDocument $ parseLBS page

-- Process the list of data elements
processData = mapM_ putStrLn

-- main = do
parseAF :: IO ()
parseAF = do
     cursor <- cursorFor url
     processData $ cursor $// findNodes &| extractData

UPDATE After more exploring it seems that the problem lies with element "article". If I replace that with element "p", which is OK in this instance as the only ps are in the article anyway, then I get my links. Pretty weird....!!

If I look for the source for that website the only <article> tag doesn't actually contain any links in it, it would appear that it's actually populated by javascript that runs after the HTML is loaded. — bheklilr
– bheklilr, Commented Dec 16, 2015 at 15:36
Eh No. <article> runs from line 185 to 267 and includes numerous links - some in the text and some wrapped by wordpress around images (which I actually don't need) — Simon H
– Simon H, Commented Dec 16, 2015 at 15:54
Ah, yes, you would appear to be correct. Can you verify that when you download it with simpleHttp that you get the same content? I would try it myself but I'm currently behind a corporate proxy that makes it difficult to do that. — bheklilr
– bheklilr, Commented Dec 16, 2015 at 15:57

Michael Szvetits · Accepted Answer · 2016-08-01 09:46:13Z

6

I think you can do this in a very readable way with HXT by composing filters:

{-# LANGUAGE Arrows #-}

import Text.XML.HXT.Core
import Text.XML.HXT.Curl
import Text.XML.HXT.TagSoup

links url = extract (readDocument
  [ withParseHTML yes
  , withTagSoup
  , withCurl      []
  , withWarnings  no
  ] url)

extract doc = runX $ doc >>> xmlFilter "article" >>> xmlFilter "a" >>> toHref

xmlFilter name = deep (hasName name)

toHref = proc el -> do
   link    <- getAttrValue "href" -< el
   returnA -< link

You can call this in the following way:

links "http://www.amsterdamfoodie.nl/2015/wine-beer-food-restaurants-troost/"

edited Aug 1, 2016 at 9:46

answered Dec 16, 2015 at 17:43

Michael Szvetits

3941 gold badge2 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

doron Over a year ago

Does assume the html is xml compliant?

Michael Szvetits Over a year ago

Since the option "withParseHTML" is set, an HTML parser is used instead of an XML parser. So it is assumed to be valid HTML.

Rein Henrichs Over a year ago

A bit more readable still with the redundant parents in extractLinks removed.

Michael Szvetits Over a year ago

You are right, I removed the unnecessary parentheses.

awllower Over a year ago

Might I ask why, in the definition of extractLinks you use a return in the end? I think a bind with a return makes it identity, so we can just ignore that return?

|

Simon H · Accepted Answer · 2015-12-17 08:43:40Z

2

OK, so the problem was that &/ only looks at immediate children, whereas &// will go through all descendants

findNodes = element "article" &// element "a"

answered Dec 17, 2015 at 8:43

Simon H

21.2k14 gold badges84 silver badges144 bronze badges

Collectives™ on Stack Overflow

Parsing html in haskell

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related