4

please help i am doing a html parsing using MSHTML. My code for getting all attributes of a particular tag is like this

void GetAttributes(MSHTML::IHTMLElementPtr pColumnInnerElement)
{
    IHTMLDOMNode *pElemDN = NULL;
    LONG lACLength;
    MSHTML::IHTMLAttributeCollection *pAttrColl;
    IDispatch* pACDisp;
    VARIANT vACIndex;
    IDispatch* pItemDisp;
    IHTMLDOMAttribute* pItem;
    BSTR bstrName;
    VARIANT vValue;
    VARIANT_BOOL vbSpecified;
    pColumnInnerElement->QueryInterface(IID_IHTMLDOMNode, (void**)&pElemDN);
    if (pElemDN != NULL)
    {
        pElemDN->get_attributes(&pACDisp);
        pACDisp->QueryInterface(IID_IHTMLAttributeCollection, (void**)&pAttrColl);
        pAttrColl->get_length(&lACLength);
        vACIndex.vt = VT_I4;
        for (int i = 0; i < lACLength; i++)
        {

            vACIndex.lVal = i;
            pItemDisp = pAttrColl->item(&vACIndex);
            if (pItemDisp != NULL)
            {
               pItemDisp->QueryInterface(IID_IHTMLDOMAttribute, (void**)&pItem);
               pItem->get_specified(&vbSpecified);
               pItem->get_nodeName(&bstrName);
               pItem->get_nodeValue(&vValue);

               if (vbSpecified)
                cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
               pItem->Release();
            }
            pItemDisp->Release();

        }
        pElemDN->Release();
        pACDisp->Release();
        pAttrColl->Release();
    }
}

The problem is for given tag <input id="Switch l_id2" class="pointer" name="Switch" onclick='SetControl("Switch l",1)' type="button" value="OK"> it prints all attributes except value attribute. The get_specified function is returning false for value attribute.

My output is

id :Switch l_id2
class :pointer
onclick :SetControl("Switch l",1)
type :button
name :Switch

Any idea why? Also which other attributes may have this problem??

Note

I tried like this. Its showing the correct attribute results for value.

        if (strcmp(_com_util::ConvertBSTRToString(bstrName), "value") == 0)
        {
            cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
        }
7
  • What does your Note mean? Is it due to the vbSpecified test? Commented Jun 3, 2013 at 6:42
  • I added Note to show correct value is in vValue.bstrVal. But still vbSpecified is returning false Commented Jun 3, 2013 at 7:05
  • Not sure the specified flag is always meaningful. Have you tried to change the document compatibility mode (msdn.microsoft.com/en-us/library/cc288325.aspx). For example, specified is always TRUE when IE is in IE9 'Standards mode'. Commented Jun 3, 2013 at 7:12
  • @SimonMourier I want to parse every tag and every attribute in a html document. Is there any other way using cpp. i already started html parsing using MSHTML. Any advice will be helpfull Commented Jun 3, 2013 at 7:13
  • My web page is in IE-8 compatible mode. And i didnt find any documents mentioning this type of information about get_specified. For input type text, get_specified is returning false fro attribute type. But its working for input type button Commented Jun 3, 2013 at 7:20

4 Answers 4

3
+150

If you are working in managed(CLI) VC++ then you can consider the HTML Agility Pack, available via nuget.

If sticking to MSHTML is not necessary then probably you can opt for parsing the HTML documents as XML documents. That way you would be able to parse all the tags and attributes with a lot of flexibility. There are plenty of XML parsers available for C++.

This library looks compact simple and efficient (available for multiple platforms): https://github.com/leethomason/tinyxml2

Another one is: http://pugixml.org/

This link may help you if you want to get rid of MSHTML dependency: http://www.codeproject.com/Articles/30342/Remove-Microsoft-mshtml-dependency

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your time and answer. Yes i know there are so many other parsers. After waiting for 2 3 days and no reply here i selected another HTML parser mentioned in another SO thread
3

Do you really care about the flag of specified? You said you want to process all attributes, I think if this is the case you don't need to care about the specified flag, just process all attributes.

Other thing is if I were you, I'll use CComPtr to instead of all naked com pointer.

1 Comment

I am not that much familiar with Visual studio and other advanced C++ terms like CComPtr. I dont know which all attributes are there in my tag. So if i use get_nodeValue() withut checking specified flag its returning null pointer and even bad pointer some times.
2

I've never worked with this before, but according to the library docs and DOM specs, it seems that get_nodeValue() does different things depending on the type of "node object". Try calling get_nodeValue() or get_nodeName() on the IHTMLDOMNode object. It seems clear that some properties like "value", "ID" and "Name" are not part of the attribute collection under the DOM.


MSHTML docs:

DOM spec:

3 Comments

Thank you for your time. Actually get_nodeName() returns tag name ie INPUT in my case not the attribute name. And i checked almost all those interfaces of IHTMLDOMNode also in my code.
Also problem is not in interface funtion get_nodeValue(). From my note it is clear that this function returns correct value, but get_specified is returning false even if it is specified in the tag
Sorry, I must have misunderstood the question (never used this library before). Both documents listed in my answer state that the specified flag should be true for the value attribute. This is an old MS library and it may have bugs though. I'd recommend switching to a more generic XML parsing engine like cpz suggested in his answer.
2

check for the input type, then query for the IID_IHTMLInputElement interface, then use get_value.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.