5

I'm trying to parse the a links from the main part (<article>) of a blog post. I have adapted what I found on FPComplete but nothing is printed out. (The code does not work as far as I can see as running it on the online IDE and with the Bing target also produces no links.)

In GHCI I can simulate the first line of parseAF and that gets me a large record, which I take to be correct. But cursor $// findNodes &| extractData returns []

I've tried regex but that wasn't happy trying to find such a long piece of text.

Can anyone help?

{-# LANGUAGE OverloadedStrings #-}

module HtmlParser where

import Network.HTTP.Conduit (simpleHttp)
import Prelude hiding (concat, putStrLn)
import Data.Text (concat)
import Data.Text.IO (putStrLn)
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (Cursor, attribute, element, fromDocument, ($//), (&//), (&/), (&|))

-- The URL we're going to search
url = "http://www.amsterdamfoodie.nl/2015/wine-beer-food-restaurants-troost/"

-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = element "article" &/ element "a"

-- Extract the data from each node in turn
extractData = concat . attribute "href"

cursorFor :: String -> IO Cursor
cursorFor u = do
     page <- simpleHttp u
     return $ fromDocument $ parseLBS page

-- Process the list of data elements
processData = mapM_ putStrLn

-- main = do
parseAF :: IO ()
parseAF = do
     cursor <- cursorFor url
     processData $ cursor $// findNodes &| extractData

UPDATE After more exploring it seems that the problem lies with element "article". If I replace that with element "p", which is OK in this instance as the only ps are in the article anyway, then I get my links. Pretty weird....!!

3
  • 2
    If I look for the source for that website the only <article> tag doesn't actually contain any links in it, it would appear that it's actually populated by javascript that runs after the HTML is loaded. Commented Dec 16, 2015 at 15:36
  • Eh No. <article> runs from line 185 to 267 and includes numerous links - some in the text and some wrapped by wordpress around images (which I actually don't need) Commented Dec 16, 2015 at 15:54
  • Ah, yes, you would appear to be correct. Can you verify that when you download it with simpleHttp that you get the same content? I would try it myself but I'm currently behind a corporate proxy that makes it difficult to do that. Commented Dec 16, 2015 at 15:57

2 Answers 2

6

I think you can do this in a very readable way with HXT by composing filters:

{-# LANGUAGE Arrows #-}

import Text.XML.HXT.Core
import Text.XML.HXT.Curl
import Text.XML.HXT.TagSoup

links url = extract (readDocument
  [ withParseHTML yes
  , withTagSoup
  , withCurl      []
  , withWarnings  no
  ] url)

extract doc = runX $ doc >>> xmlFilter "article" >>> xmlFilter "a" >>> toHref

xmlFilter name = deep (hasName name)

toHref = proc el -> do
   link    <- getAttrValue "href" -< el
   returnA -< link

You can call this in the following way:

links "http://www.amsterdamfoodie.nl/2015/wine-beer-food-restaurants-troost/"
Sign up to request clarification or add additional context in comments.

6 Comments

Does assume the html is xml compliant?
Since the option "withParseHTML" is set, an HTML parser is used instead of an XML parser. So it is assumed to be valid HTML.
A bit more readable still with the redundant parents in extractLinks removed.
You are right, I removed the unnecessary parentheses.
Might I ask why, in the definition of extractLinks you use a return in the end? I think a bind with a return makes it identity, so we can just ignore that return?
|
2

OK, so the problem was that &/ only looks at immediate children, whereas &// will go through all descendants

findNodes = element "article" &// element "a"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.