4

I'm having a problem trying to convert an HTML table into a Golang array. I've tried to achieve it using x/net/html and goquery, without any success on both of them.

Let's say we have this HTML table:

<html>
  <body>
    <table>
      <tr>
        <td>Row 1, Content 1</td>
        <td>Row 1, Content 2</td>
        <td>Row 1, Content 3</td>
        <td>Row 1, Content 4</td>
      </tr>
      <tr>
        <td>Row 2, Content 1</td>
        <td>Row 2, Content 2</td>
        <td>Row 2, Content 3</td>
        <td>Row 2, Content 4</td>
      </tr>
    </table>
  </body>
</html>

And I'd like to end up with this array:

------------------------------------
|Row 1, Content 1| Row 1, Content 2|
------------------------------------
|Row 2, Content 1| Row 2, Content 2|
------------------------------------

As you guy can see, I'm just ignoring Contents 3 and 4.

My extraction code:

func extractValue(content []byte) {
  doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))

  doc.Find("table tr td").Each(func(i int, td *goquery.Selection) {
    // ...
  })
}

I've tried to add a controller number which would be responsible for ignoring the <td> that I don't want to convert and calling

td.NextAll()

but with no luck. Do you guys have any idea of what should I do to accomplish it?

Thanks.

2
  • Can you add the actual code you used? Commented Mar 12, 2016 at 20:48
  • The html table doesn't look valid. There's no closing td tags here if I'm not mistaken. Commented Nov 3, 2017 at 6:48

3 Answers 3

7

You can get away with package golang.org/x/net/html only.

var body = strings.NewReader(`                                                                                                                            
        <html>                                                                                                                                            
        <body>                                                                                                                                            
        <table>                                                                                                                                           
        <tr>                                                                                                                                              
        <td>Row 1, Content 1</td>                                                                                                                          
        <td>Row 1, Content 2</td>                                                                                                                          
        <td>Row 1, Content 3</td>                                                                                                                          
        <td>Row 1, Content 4</td>                                                                                                                          
        </tr>                                                                                                                                             
        <tr>                                                                                                                                              
        <td>Row 2, Content 1</td>                                                                                                        
        <td>Row 2, Content 2</td>                                                                                                                          
        <td>Row 2, Content 3</td>                                                                                                                          
        <td>Row 2, Content 4</td>                                                                                                                          
        </tr>  
        </table>                                                                                                                                          
        </body>                                                                                                                                           
        </html>`)          

func main() {
    z := html.NewTokenizer(body)
    content := []string{}

    // While have not hit the </html> tag
    for z.Token().Data != "html" {
        tt := z.Next()
        if tt == html.StartTagToken {
            t := z.Token()
            if t.Data == "td" {
                inner := z.Next()
                if inner == html.TextToken {
                    text := (string)(z.Text())
                    t := strings.TrimSpace(text)
                    content = append(content, t)
                }
            }
        }
    }
    // Print to check the slice's content
    fmt.Println(content)
}

This code is written only for this typical HTML pattern only, but refactoring it to be more general wouldn't be hard.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much. After changing a little your code I got the result expected. play.golang.org/p/zykOWwHCcG
@DanilloSouza happy to help!
1

If you need a more structured way of extracting data from HTML Tables, https://github.com/nfx/go-htmltable does support the row/colspans.

type AM4 struct {
    Model             string `header:"Model"`
    ReleaseDate       string `header:"Release date"`
    PCIeSupport       string `header:"PCIesupport[a]"`
    MultiGpuCrossFire bool   `header:"Multi-GPU CrossFire"`
    MultiGpuSLI       bool   `header:"Multi-GPU SLI"`
    USBSupport        string `header:"USBsupport[b]"`
    SATAPorts         int    `header:"Storage features SATAports"`
    RAID              string `header:"Storage features RAID"`
    AMDStoreMI        bool   `header:"Storage features AMD StoreMI"`
    Overclocking      string `header:"Processoroverclocking"`
    TDP               string `header:"TDP"`
    SupportExcavator  string `header:"CPU support[14] Excavator"`
    SupportZen        string `header:"CPU support[14] Zen"`
    SupportZenPlus    string `header:"CPU support[14] Zen+"`
    SupportZen2       string `header:"CPU support[14] Zen 2"`
    SupportZen3       string `header:"CPU support[14] Zen 3"`
    Architecture      string `header:"Architecture"`
}
am4Chipsets, _ := htmltable.NewSliceFromURL[AM4]("https://en.wikipedia.org/wiki/List_of_AMD_chipsets")
fmt.Println(am4Chipsets[2].Model)
fmt.Println(am4Chipsets[2].SupportZen2)

// Output:
// X370
// Varies[c]

Comments

-1

Try an approach like this to make a 2d array and handle variable row sizes:

    z := html.NewTokenizer(body)
    table := [][]string{}
    row := []string{}

    for z.Token().Data != "html" {
        tt := z.Next()
        if tt == html.StartTagToken {
            t := z.Token()

            if t.Data == "tr" {
                if len(row) > 0 {
                    table = append(table, row)
                    row = []string{}
                }
            }

            if t.Data == "td" {
                inner := z.Next()

                if inner == html.TextToken {
                    text := (string)(z.Text())
                    t := strings.TrimSpace(text)
                    row = append(row, t)
                }
            }

        }
    }
    if len(row) > 0 {
        table = append(table, row)
    }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.