9

I need to find elements in an HTML string. Unfortunately the HTML is pretty much broken (e.g. closing tags without an opening pair).

I tried to use XPath with launchpad.net/xmlpath but it can't parse an HTML file so damn buggy.

How can I find elements in a broken HTML with golang? I would prefer using XPath, but I am open for other solutions too if I can use it to look for tags with a specific id or class.

1
  • For those stumbling on this issue now, note that the xmlpath project has moved (and improved) to gopkg.in/xmlpath.v1 . Commented Apr 16, 2015 at 20:50

1 Answer 1

21

It seems net/html does the job.

So that's what I am doing now:

package main

import (
    "strings"
    "golang.org/x/net/html"
    "log"
    "bytes"
    "gopkg.in/xmlpath.v2"
)

func main() {
    brokenHtml := `<!DOCTYPE html><html><body><h1 id="someid">My First Heading</h1><p>paragraph</body></html>`

    reader := strings.NewReader(brokenHtml)
    root, err := html.Parse(reader)

    if err != nil {
        log.Fatal(err)
    }

    var b bytes.Buffer
    html.Render(&b, root)
    fixedHtml := b.String()

    reader = strings.NewReader(fixedHtml)
    xmlroot, xmlerr := xmlpath.ParseHTML(reader)

    if xmlerr != nil {
        log.Fatal(xmlerr)
    }

    var xpath string
    xpath = `//h1[@id='someid']`
    path := xmlpath.MustCompile(xpath)
    if value, ok := path.String(xmlroot); ok {
        log.Println("Found:", value)
    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

Do you know how to iterate over all Nodes which match a given XPath? Thanks.
iter := path.Iter(xmlroot) for iter.Next() { log.Println(iter.Node().String()) }

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.