I'm trying to build a web-scraper using Go, I'm fairly new to the language and I'm not sure what I'm doing wrong while using the html parser. I'm trying to parse the html to find anchor tags but I keep getting html.TokenTypeEnd instead.
package main
import (
"fmt"
"golang.org/x/net/html"
"io/ioutil"
"net/http"
)
func GetHtml(url string) (text string, resp *http.Response, err error) {
var bytes []byte
if url == "https://www.coastal.edu/scs/employee" {
resp, err = http.Get(url)
if err != nil {
fmt.Println("There seems to ben an error with the Employee Console.")
}
bytes, err = ioutil.ReadAll(resp.Body)
if err != nil {
fmt.Println("Cannot read byte response from Employee Console.")
}
text = string(bytes)
} else {
fmt.Println("Issue with finding URL. Looking for: " + url)
}
return text, resp, err
}
func main() {
htmlSrc, response, err := GetHtml("https://www.coastal.edu/scs/employee")
if err != nil {
fmt.Println("Cannot read HTML source code.")
}
_ = htmlSrc
htmlTokens := html.NewTokenizer(response.Body)
i := 0
for i < 1 {
tt := htmlTokens.Next()
fmt.Printf("%T", tt)
switch tt {
case html.ErrorToken:
fmt.Println("End")
i++
case html.TextToken:
fmt.Println(tt)
case html.StartTagToken:
t := htmlTokens.Token()
isAnchor := t.Data == "a"
if isAnchor {
fmt.Println("We found an anchor!")
}
}
}
I'm getting html.TokenTypeEnd whenever I'm printing
fmt.Printf("%T", tt)
response.Bodyonce. It has already been used up in yourGetHtmlfunction. Why are you reading the whole html string, then tossing it anyway?io.Readers orio.ReadClosers, you want to avoid reading it all into a variable if you can. There are optimizations for these types that can make things more efficient if used properly. This is whyhtml.NewTokenizertakes one in the first place. Just some advice. It's often totally okay toioutil.ReadAllif you are sure that the response isn't enormous.html.NewTokenizercan take advantage of this by reading in just enough data to get a full token without having to have the full input in memory. There is a lot of cool stuff going on behind the scenes with go. Read the godocs and feel free to delve into the source (which is linked directly from the docs) when you have learned more, or want to know what is really going on. Go is written in Go :)