1

I'm trying to build a web-scraper using Go, I'm fairly new to the language and I'm not sure what I'm doing wrong while using the html parser. I'm trying to parse the html to find anchor tags but I keep getting html.TokenTypeEnd instead.

package main

import (
    "fmt"
    "golang.org/x/net/html"
    "io/ioutil"
    "net/http"
)

func GetHtml(url string) (text string, resp *http.Response, err error) {
    var bytes []byte
    if url == "https://www.coastal.edu/scs/employee" {
        resp, err = http.Get(url)
        if err != nil {
            fmt.Println("There seems to ben an error with the Employee Console.")
        }
        bytes, err = ioutil.ReadAll(resp.Body)
        if err != nil {
            fmt.Println("Cannot read byte response from Employee Console.")
        }
        text = string(bytes)
    } else {
        fmt.Println("Issue with finding URL. Looking for: " + url)
    }

    return text, resp, err
}

func main() {
    htmlSrc, response, err := GetHtml("https://www.coastal.edu/scs/employee")
    if err != nil {
        fmt.Println("Cannot read HTML source code.")
    }
    _ = htmlSrc
    htmlTokens := html.NewTokenizer(response.Body)
    i := 0
    for i < 1 {

        tt := htmlTokens.Next()
        fmt.Printf("%T", tt)
        switch tt {

        case html.ErrorToken:
            fmt.Println("End")
            i++

        case html.TextToken:
            fmt.Println(tt)

        case html.StartTagToken:
            t := htmlTokens.Token()

            isAnchor := t.Data == "a"
            if isAnchor {
                fmt.Println("We found an anchor!")
            }

        }

    }

I'm getting html.TokenTypeEnd whenever I'm printing fmt.Printf("%T", tt)

5
  • 1
    You can only read the response.Body once. It has already been used up in your GetHtml function. Why are you reading the whole html string, then tossing it anyway? Commented Sep 20, 2017 at 1:25
  • I'm used to Python so I thought that I had to read the html and return it as a string. This is the first Go program that I've written and I'm very new to the language so I'm trying to understand it. Commented Sep 20, 2017 at 1:34
  • When you come across io.Readers or io.ReadClosers, you want to avoid reading it all into a variable if you can. There are optimizations for these types that can make things more efficient if used properly. This is why html.NewTokenizer takes one in the first place. Just some advice. It's often totally okay to ioutil.ReadAll if you are sure that the response isn't enormous. Commented Sep 20, 2017 at 1:37
  • Thank you! I'm going to definitely keep your advice in mind moving along to future projects. So io.Reader is more like a buffer? Commented Sep 20, 2017 at 1:41
  • 1
    Yes. Depending on the underlying source, it could actually be reading from e.g. a network socket, or some other source not actually in memory at the time. Something like html.NewTokenizer can take advantage of this by reading in just enough data to get a full token without having to have the full input in memory. There is a lot of cool stuff going on behind the scenes with go. Read the godocs and feel free to delve into the source (which is linked directly from the docs) when you have learned more, or want to know what is really going on. Go is written in Go :) Commented Sep 20, 2017 at 1:51

1 Answer 1

8

The application reads to the end of the body in GetHtml. The tokenizer returns html.TokenTypeEnd because read on the body returns EOF.

Use this code:

htmlTokens := html.NewTokenizer(strings.NewReader(htmlSrc))

to create the tokenizer.

Also, close the response body inGetHtml to prevent a connection leak.

The code can be simplified to:

    response, err := http.Get("https://www.coastal.edu/scs/employee")
    if err != nil {
        log.Fatal(err)
    }
    defer response.Body.Close()
    htmlTokens := html.NewTokenizer(response.Body)
loop:
    for {
        tt := htmlTokens.Next()
        fmt.Printf("%T", tt)
        switch tt {
        case html.ErrorToken:
            fmt.Println("End")
            break loop
        case html.TextToken:
            fmt.Println(tt)
        case html.StartTagToken:
            t := htmlTokens.Token()
            isAnchor := t.Data == "a"
            if isAnchor {
                fmt.Println("We found an anchor!")
            }
        }
    }
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, this fixed the issue and I wasn't even aware of the connection leak. I'm very new to Go obviously
That's actually exactly what I did lol. Thank you though, great advice!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.