2

I'm new to web scraping and the HTML language.

I'm trying to write a code in VBA to extract data from the following website: https://companies.govmu.org:4343/MNSOnlineSearch/

I have an Excel sheet with over 5000 company names and their respective "File No" in columns A and B respectively, and I need to input their "Status" (either "Live" or "Defunct") in column C. This will be done after searching for each company by "File No" and then extracting their status to the Excel sheet.

The issue is that I can't seem to get the element containing the data that I need.

I've already written the bit of code which will extract the "File No" from my Excel sheet, paste it on the webpage in the "File No" search box, and run the search. (You can try searching C5113, as an example).

However, on the resulting webpage, I've tried getting the element containing the data that I need, but it does not work.

For example, I tried to MsgBox (MsgBox is my personal way to check whether my variable contains the data I need) the inner HTML of the tag fieldset (fs) with ID "CompanyList" as shown in the code below, but it returns an error.

I've also tried with another variable named div, of data type HTMLDivElement, and then getting the element by ID "companies".

And finally, I've also tried looping through a variable of type IHTMLElementCollection to look for the element that I need, but it still does not show the element that I need (it shows other elements that I don't need).

Option Explicit

Sub ExtractStatusDetails()
    Dim ie As InternetExplorer
    Dim html As HTMLDocument
    Dim resultHtml As HTMLDocument
    Dim fs As IHTMLElement
    Dim searchBoxes As IHTMLElementCollection
    Dim searchButton As Object
    Dim homePage As String

    homePage = "https://companies.govmu.org:4343/MNSOnlineSearch/"
    Set ie = New InternetExplorer   
    ie.Visible = False
    ie.navigate homePage
    Do While ie.readyState <> READYSTATE_COMPLETE
        DoEvents
    Loop

    Set html = ie.document
    Set searchBoxes = html.getElementsByClassName("col-md-6 col-lg-4")

    searchBoxes(0).innerHTML = Replace(searchBoxes(0).innerHTML, "placeholder", "value")
    searchBoxes(0).innerHTML = Replace(searchBoxes(0).innerHTML, "Search company by File No...", "C63")

    Set searchButton = searchBoxes(0).getElementsByClassName("btn btn-large btn-primary btn-raised")
    searchButton(0).Click
    Do While ie.readyState <> READYSTATE_COMPLETE
        DoEvents
    Loop

    Set resultHtml = ie.document
    Set fs = resultHtml.getElementById("CompanyList")

    MsgBox fs.innerHTML

    ie.Quit

End Sub
0

1 Answer 1

1

The page does an xmlhttp POST request which retrieves data from a backend data store (likely Oracle GlassFish > JDBC API > data repository e.g. MySQL) . It returns all similar matches, possibly including exact.

You can find the POST request in the network traffic of browser dev tools after you enter the fileNo and press the search button.

Below is a function you can call in a loop over your fileNos to retrieve the company status

Option Explicit

Public Sub test()
    Dim fileNo As String, xmlhttp As Object
    fileNo = "C5113"
    Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
    MsgBox GetCompanyStatus(fileNo, xmlhttp)
End Sub

Public Function GetCompanyStatus(ByVal fileNo As String, ByVal xmlhttp As Object) As String
    Dim html As HTMLDocument, body As String, fileNos As Object, i As Long

    Set html = New HTMLDocument
    body = "tabs=tab-1&searchByName=&searchByFileNo=PLACEHOLDER&submitCompanies=&searchByBusName=&searchByBRN=&searchByIncDateFrom=&searchByIncDateTo=&doAction=search"

    With xmlhttp
        .Open "POST", "https://companies.govmu.org:4343/MNSOnlineSearch/GetCompanies", False
        .setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
        .send Replace$(body, "PLACEHOLDER", fileNo)
        html.body.innerHTML = .responseText
        Set fileNos = html.querySelectorAll("td.tdFileNo")
        If fileNos.Length > 0 Then
            For i = 0 To fileNos.Length - 1
                If fileNos.item(i).innerText = fileNo Then
                    GetCompanyStatus = html.querySelectorAll("td.tdStatus").item(i).innerText
                    Exit Function
                End If
            Next i
        End If
        GetCompanyStatus = "Not found"
    End With
End Function

I would instead consider how you can group your requests. As you can post partial file numbers you could cut down on the number of requests considerably by doing in batches with partial file numbers e.g. search for C5 or C51; the backend then does something like "C5%" to return all matches starting with the specified string, and then loop those results searching for your file numbers of interest that fall within that range.

You could have a dictionary with fileNo as key and status as value and update this as you loop the results returned by a request. I think the number of keys is constrained by Long, so no problem I think for storing all your fileNos at the start, in a dictionary, and updating later during requests. You could even have multiple dictionaries that host ranges of fileNos, like the volumes of the good old Encyclopædia Britannica. This would limit loops to dicts you hope to populate from the same request, for example. Is something to play around with an explore.

Sign up to request clarification or add additional context in comments.

7 Comments

Great code. When trying to study the code, I didn't know how to inspect the URL .. Can you add explanation to that part please?
And why GET fails although the url works on the browser url = "https://companies.govmu.org:4343/MNSOnlineSearch/GetCompanies?tabs=tab-1&searchByName=&searchByFileNo=" & fileNo & "&submitCompanies=&searchByBusName=&searchByBRN=&searchByIncDateFrom=&searchByIncDateTo=&doAction=search"
As to your last point - I don't know. Perhaps the browser goes on to do a POST request in the background.
See edit to answer. You can probably monitor the GET request in browser via dev tools and see what is going on
Thanks a lot for the detailed explanation @QHarr. The code works perfectly. On a side note, I'm trying to understand why it's not possible to use the method I described? Since the body of the response HTML document already contains the status details after I click on search, why can't I extract it? Does it have something to do with the characteristics of POST vs GET requests?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.