2

I am currently using Jsoup parser to take in some HTML and then change some titles for specific tags. The problem is that my html seems to be altered when it is put through the jsoup parser for some reason. Is there a way to tell jsoup to not append any html.body tags, or to not add missing tags?

It seems to be altering my tables.

Orig

 <div class="mstrPanelPortrait">
         <table cellpadding="0" class="pane" cellspacing="0">
            <tr>
               <td>
                  <table class="pane" cellspacing="0">
                     <tr>
                        <td class="mstrPanelBody" sty="body">
                           <div>
                              <div class="mstrBrowser">

After put through jsoup

 <div class="mstrPanelPortrait" title="darrensTest">
         <table cellpadding="0" class="pane" cellspacing="0">
            <tbody>
               <tr>
                  <td>
                     <table class="pane" cellspacing="0">
                        <tbody>
                           <tr>
                              <td class="mstrPanelBody" sty="body">

You can see tbody was added in a few places. Not sure why

entire html

<div or="2" class="mstrTransform" cx="[0,1,2,3,4]" id="FolderObjectBrowser_display" ty="editor" cxid="FolderObjectBrowser_display_cmm" rsz="0" dg="0" iframe="true" style="display:block;" name="FolderObjectBrowser_display" scriptclass="mstrReportAllObjectsImpl" ors="3">
   <form id="FolderObjectBrowser_display_form" name="FolderObjectBrowser_display_form" target="frameManager" action="mstrWeb" method="post" onsubmit="appendPageState(this);">
      <input id="iframe" name="iframe" value="true" class="mstrHiddenInput" type="hidden"/>
      <input name="evt" value="5005" class="mstrHiddenInput" type="hidden"/>
      <input name="src" value="mstrWeb.report.5005" class="mstrHiddenInput" type="hidden"/>
      <div class="mstrPanelPortrait">
         <table cellpadding="0" class="pane" cellspacing="0">
            <tr>
               <td>
                  <table class="pane" cellspacing="0">
                     <tr>
                        <td class="mstrPanelBody" sty="body">
                           <div>
                              <div class="mstrBrowser">
                                 <div id="folerBoxContainerID" class="folerBoxContainer">
                                    <select name="oeFolderID" class="mstrAncestors" sty="folderList">
                                       <option selected="1" title="MicroStrategy Tutorial" level="0" value="D43364C684E34A5F9B2F9AD7108F7828">MicroStrategy Tutorial</option>
                                       <option islink="true" title="Data Explorer" level="0" value="37ED6C6202E14C3181F1F4A043A1CAA8">Data Explorer</option>
                                       <option islink="true" title="My Personal Objects" level="0" value="8D67908E11D3E4981000E787EC6DE8A4">My Personal Objects</option>
                                       <option islink="true" title="Attributes" level="0" value="6F55FB47F9974EABA18CB0C5FF46785C">Attributes</option>
                                       <option islink="true" title="Metrics" level="0" value="E0CCB9CF22104A489CBE78D974AFD19E">Metrics</option>
                                       <option islink="true" title="Hierarchies" level="0" value="C2A0BB1ACAAD45A18B8CA8AECF0A35EE">Hierarchies</option>
                                    </select>
                                    <a><img id="upFolder" title="Up One Level" alt="Up One Level" name="upFolder" class="mstrIcon-btn mstrIcon-btnUpFolderDisabled" src="../images/1ptrans.gif"/></a><a target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=83005&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.83005"><img id="changeFormat" title="Tree" alt="Tree" name="changeFormat" class="mstrIcon-btn mstrIcon-btnChangeDisplayFormatTree" src="../images/1ptrans.gif"/></a>
                                 </div>
                                 <div class="mstrSearchDiv"><span id="name_label">Find:</span><input id="searchArg" name="name" value="" class="mstrInputText" onkeydown="return microstrategy.bone('FolderObjectBrowser_display').checkForFormSubmit(arguments[0]);" type="text"/><input id="search" title="Find" alt="Find" name="98002" class="mstrIcon-btn mstrIcon-btnFind" src="../images/1ptrans.gif" border="0" type="image"/></div>
                                 <div style="position:relative">
                                    <div sty="fileList">
                                       <div id="list" class="mstrSmallIconView">
                                          <div title="Folder:  Project Builder; Folder for all the objects created by Project Builder" dss_ty="8"><span class="mstrIcon-lv-f mstrIcon-lv"><span></span></span><a title="Folder for all the objects created by Project Builder" target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=98001&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001&oeFolderBlockBegin=1&oeFolderID=42EEDD41A6954F7485453C170AA3F8BE">Project Builder</a></div>
                                          <div title="Folder:  Project Objects" dss_ty="8"><span class="mstrIcon-lv-f mstrIcon-lv"><span></span></span><a title="" target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=98001&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001&oeFolderBlockBegin=1&oeFolderID=02C37D85EE25483AA5708E2BFE858B92">Project Objects</a></div>
                                          <div title="Folder:  Public Objects; Folder for all public objects" dss_ty="8"><span class="mstrIcon-lv-f mstrIcon-lv"><span></span></span><a title="Folder for all public objects" target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=98001&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001&oeFolderBlockBegin=1&oeFolderID=98FE182C2A10427EACE0CD30B6768258">Public Objects</a></div>
                                          <div title="Folder:  Schema Objects; Folder for all schema objects" dss_ty="8"><span class="mstrIcon-lv-f mstrIcon-lv"><span></span></span><a title="Folder for all schema objects" target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=98001&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001&oeFolderBlockBegin=1&oeFolderID=95C3B713318B43D490EE789BE27D298C">Schema Objects</a></div>
                                          <div title="Folder:  Data Explorer; Hierarchy groups folder" dss_ty="8"><span class="mstrIcon-lv mstrIcon-lv-fh"><span class="sc"></span></span><a title="Hierarchy groups folder" target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=98001&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001&oeFolderBlockBegin=1&oeFolderID=37ED6C6202E14C3181F1F4A043A1CAA8">Data Explorer</a></div>
                                          <div title="Folder:  My Personal Objects" dss_ty="8"><span class="mstrIcon-lv mstrIcon-lv-fmo"><span class="sc"></span></span><a title="" target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=98001&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001&oeFolderBlockBegin=1&oeFolderID=8D67908E11D3E4981000E787EC6DE8A4">My Personal Objects</a></div>
                                          <div title="Folder:  Attributes" dss_ty="8"><span class="mstrIcon-lv mstrIcon-lv-fa"><span class="sc"></span></span><a title="" target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=98001&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001&oeFolderBlockBegin=1&oeFolderID=6F55FB47F9974EABA18CB0C5FF46785C">Attributes</a></div>
                                          <div title="Folder:  Metrics" dss_ty="8"><span class="mstrIcon-lv mstrIcon-lv-fm"><span class="sc"></span></span><a title="" target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=98001&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001&oeFolderBlockBegin=1&oeFolderID=E0CCB9CF22104A489CBE78D974AFD19E">Metrics</a></div>
                                          <div title="Folder:  Hierarchies" dss_ty="8"><span class="mstrIcon-lv mstrIcon-lv-fh"><span class="sc"></span></span><a title="" target="frameManager" class="mstrLink" onclick="return submitLink(this, event);" href="mstrWeb?iframe=true&evt=98001&src=mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001&oeFolderBlockBegin=1&oeFolderID=C2A0BB1ACAAD45A18B8CA8AECF0A35EE">Hierarchies</a></div>
                                       </div>
                                    </div>
                                 </div>
                                 <table id="FolderObjectBrowser_display_oCount" width="100%" name="FolderObjectBrowser_display_oCount" cellpadding="0" border="0" cellspacing="2">
                                    <tr>
                                       <td align="LEFT">&nbsp;4 item(s) found</td>
                                    </tr>
                                 </table>
                              </div>
                              <input name="evt" value="98001" class="mstrHiddenInput" type="hidden"/><input name="src" value="mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98001" class="mstrHiddenInput" type="hidden"/><input name="evt" value="98002" class="mstrHiddenInput" type="hidden"/><input name="src" value="mstrWeb.report.frame.accordion.tbObjBrwsr.pbt.FolderObjectBrowser.98002" class="mstrHiddenInput" type="hidden"/><input id="evtorder" name="evtorder" value="98001,98002" class="mstrHiddenInput" type="hidden"/>
                           </div>
                        </td>
                     </tr>
                  </table>
               </td>
            </tr>
         </TABLE>
      </div>
      <div class="mstrSpaceAfterEditor"><img title="" height="3" alt="" width="1" src="../images/1ptrans.gif" border="0"/></div>
   </form>
</div>
3
  • 1
    Please provide the entire code of the table, until the </table>. Otherwise we can't tell whether Jsoup is trying to correct malformed HTML. Commented Oct 7, 2013 at 20:07
  • Yeah I assume its trying to correct malformed html. Is there anyway to tell it to not do so:) Commented Oct 7, 2013 at 20:08
  • Dunno about jsoup, but Jericho (jericho.htmlparser.net/docs/index.html) "is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML." Commented Oct 7, 2013 at 20:32

1 Answer 1

1

As far as I know, there is no way to tell Jsoup to not balance tags.

What you can use instead is the non-default XML parser that wont add any new tags (such as tbody), but only balance the tags that are not already balanced.

So, is it okay if Jsoup doesn't add any tags, but instead only balances the HTML?

If the answer to that question is yes, then you should use the XML parser instead of the default HTML parser.

doc = Jsoup.parse(html, "", Parser.xmlParser());

This will parse the HTML but add closing tags, though not add tags that aren't already there, thus not changing the structure. You can then select from the document in a normal Jsoup fashion.

<div class="mstrPanelPortrait"> 
 <table cellpadding="0" class="pane" cellspacing="0"> 
  <tr> 
   <td> 
    <table class="pane" cellspacing="0"> 
     <tr> 
      <td class="mstrPanelBody" sty="body"> 
       <div> 
        <div class="mstrBrowser"></div>
       </div></td>
     </tr>
    </table></td>
  </tr>
 </table>
</div>  
Sign up to request clarification or add additional context in comments.

5 Comments

I really would like it to not alter the html at all. The closing tags may be somewhere else on the page. I will give the xml parser a shot and see what happens.
As far as I know, there is at the moment no way to leave the HTML unbalanced after letting Jsoup parse it. I have seen some patch-suggestions for this to be implemented, and I do believe it might be a choice in upcoming versions.
Thanks for the information. I may need to create my own "hacky" parser in this instance to perform this action. Thanks for the help.
I found out the jsoup is replacing something with &amp when I output the text using doc.toString(). Do you know what this might be?
You'll have to be more specific. &amp is just the way HTML displays the & (ampersand) character, according to ISO8859 standard.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.