1

In this thread [Link}(Scraping table from local HTML with unicode characters), QHarr has helped me to scrape a table from local html file. I have a html file at this Link

And I used the same code and edited a little for the variables 'startTableNumber' and 'endTableNumber' and 'numColumns'

Public Sub Test()
Dim fStream  As ADODB.Stream, html As HTMLDocument
Set html = New HTMLDocument
Set fStream = New ADODB.Stream
With fStream
    .Charset = "UTF-8"
    .Open
    .LoadFromFile "C:\Users\Future\Desktop\Sample 2.html"
    html.body.innerHTML = .ReadText
    .Close
End With

Dim hTables As Object, startTableNumber As Long, i As Long, r As Long, c As Long
Dim counter As Long, endTableNumber, numColumns As Long

startTableNumber = 91
endTableNumber = 509
numColumns = 14

Set hTables = html.getElementsByTagName("table")
r = 2: c = 1

For i = startTableNumber To endTableNumber Step 2
    counter = counter + 1
    If counter = 10 Then
        c = 1: r = r + 1: counter = 1
    End If
    Cells(r, c) = hTables(i).innerText
    c = c + 1
Next

End Sub

But I got scattered data of the table further more I would like to find a flexible way so as to make the code recognize those variables without assigning them manually I hope to find solution using selenium. Hope also not to receive negative rep. I have done my best to clarify the issue Regards

4
  • 2
    Did I really write Cells(r, c) without a qualifying sheet? hum..... :-) What might be helpful for people to know is that there are nested table tag elements many contain the same content repeated. Later in the table tag collection text content appears in individual tables in a manner that would normally be in a td tag i.e. you end up treating table tag elements at that point as if they were td when writing out. Commented Dec 10, 2018 at 19:57
  • Never mind about that point my tutor .. Commented Dec 10, 2018 at 20:00
  • 2
    This doesn't work for your current file in quite the same way as the ordering is slightly different so you can't simply take one element after the other and assume it is the next to write out to replicate the overarching table appearance from the sheet. Commented Dec 10, 2018 at 20:02
  • How can I adopt that code to be used in selenium and adjust the results? Commented Dec 10, 2018 at 20:55

1 Answer 1

1

So, as I said in my comments you need to study how the data appears in the later table tags and perform a mapping to get the correct ordering. The following writes out the table. As I also mentioned, this is not robust and only the methodology may possibly be transferable to other documents.

In your case you wouldn't be reading from file but would use

Set tables = driver.FindElementsByCss("table[width='100%'] table:first-child")

You would then For Each over the web elements in the collection adjusting the syntax as required e.g. .Text instead of .innerText. There may be a few other adaptations for selenium due to its indexing of webElements but everything you need to should be evident below.

VBA:

Option Explicit
Public Sub ParseInfo()
    Dim html As HTMLDocument, tables As Object, ws As Worksheet, i As Long
    Set ws = ThisWorkbook.Worksheets("Sheet2")
    Dim fStream  As ADODB.Stream
    Set html = New HTMLDocument
    Set fStream = New ADODB.Stream
    With fStream
        .Charset = "UTF-8"
        .Open
        .LoadFromFile "C:\Users\User\Desktop\test.html"
        html.body.innerHTML = .ReadText
        .Close
    End With

    Set tables = html.querySelectorAll("table[width='100%'] table:first-child")
    Dim rowCounter: rowCounter = 2
    Dim mappings(), j As Long, headers(), arr(13)
    headers = Array("Notes", "Type", "Enrollment status", "Governorate of birth", "Year", "Month", "Day", "Date of Birth", "Religion", _
    "Nationality", "Student Name", "National Number", "Student Code", "M")

    mappings = Array(3, 8, 9, 12, 11, 10, 2, 7, 1, 6, 5, 4, 13)
    ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers

    For i = 89 To 504 Step 26
        arr(0) = vbNullString

        For j = 0 To 12
            arr(mappings(j)) = tables.item(2 * j + i).innerText
        Next

        ws.Cells(rowCounter + 1, 1).Resize(1, UBound(arr) + 1) = arr
        rowCounter = rowCounter + 1
    Next
End Sub
Sign up to request clarification or add additional context in comments.

13 Comments

That's amazing. Thanks a lot for great help. Can you post snapshot of how you got the mappings array ..?
Another point: what about determining the 89 and 504 by the code to make it more flexible? Is it possible or not?
Is there a way to download the html page using selenium to be the same as my file attached in the first post? as I have about 200 pages ..
Should be a .count property I think
Thanks a lot for great and awesome efforts. You have helped me a lot .. Best Regards
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.