1

I'm trying to parse an HTML file which I have converted to a TXT file inside of Automator.

I previously downloaded the HTML file from a website using Automator, and I am now struggling to parse the source code.

Preferably, I want to take the information of just the table and I need to repeat this action for 1800 different HTML files.

Here is an example of the source code:

</head>
<body>
<div id="header">
    <div class="wrapper">
        <span class="access">
        <div id="fb-root"></div>


    <span class="access">
     Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a>       Logged in as Edward&nbsp;&nbsp; | &nbsp;&nbsp;<a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>

    </span>
                                    </span>
    </div><!-- /wrapper -->
</div><!-- /header -->

<div id="masthead">
    <div class="wrapper">   
        <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
        <div id="navigation">
            <ul>
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li>    <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>               
        </div><!-- /navigation -->

    </div><!-- /wrapper -->     
</div><!-- /masthead -->


<div id="content">
    <div class="wrapper">
        <div id="main-content">

 <!-- per Project stuff -->
    <span class="section">
                <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
                <h1><span id="profile-name-104947" >Christian Sieling</span></h1>
                                    <ul class="gbutton-group right">
                    <li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">&laquo; Back </a></li>
                    <li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752"  id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
                </ul>

                <div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
                <span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
                <a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
                </div>
                                    <h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>

            </span>

            <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
                                                        <tr>
                    <th>Role</th>
                    <td>
                    <p>Other</p>                            </td>
                </tr>
                <tr>  
                    <th>Organisation Type</th>
                    <td>
                    <p>Asset Manager</p>                        </td>
                </tr>
                <tr>
                    <th>Email</th>
                    <td><a href="mailto:[email protected]" title="[email protected]" >[email protected]</a></td>
                </tr>
                <tr>
                    <th>Website</th>
                    <td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
                </tr>
                <tr>
                    <th>Phone</th>
                    <td>41 78 616 7334</td>
                </tr>
                <tr>
                    <th>Fax</th>
                    <td></td> 
                </tr>
                <tr>
                    <th>Mailing Address</th>
                    <td>Birrenstrasse 30</td>
                </tr>
                <tr>
                    <th>City</th>
                    <td>Schindellegi</td>
                </tr>
                <tr>
                    <th>State</th>
                    <td>CH</td>
                </tr>
                <tr>
                    <th>Country</th>
                    <td>Switzerland</td>
                </tr>
                <tr>
                    <th class="lastrow" >Zip/ Postal Code</th>
                    <td class="lastrow" >8834</td>
                </tr>
        </table>
                </div><!-- /main-content -->
                    <div id="sidebar"  >
                    </div>

            <div id="similar_sidebar" class="similar_refine" >



            </div>
                            </div><!-- /wrapper -->
</div><!-- /content -->

<div id="footer">

</div>

My AppleScript attempt that is using text item delimiters to extract the table in a similar fashion:

set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween

How can I parse the table from the HTML file?

1
  • But you wouldn’t feel that way if AppleScript had saved you from literally years of grunt work, the way it has for millions of people in publishing and other creative fields. Commented Aug 11, 2014 at 19:57

4 Answers 4

5

Rather than make your own HTML parser, you can exploit the HTML parser in Safari via the do javascript command. JavaScript has built-in functionality for working with HTML elements and data.

This script gets the HTML for just the first table in a page:

tell application "Safari"
    tell document 1
        set theFirstTableHTML to do JavaScript "document.getElementsByTagName('table')[0].innerHTML"
    end tell
end tell

You can use this technique to apply basic DOM Scripting to any page and grab out any data that you want to read out. You can get just the values of the table cells, or whatever you want.

Sign up to request clarification or add additional context in comments.

Comments

1

You're really close. The problem is your startText variable. The starting table tag is not in the html text so it can't be found. The line that starts the table is actually...

<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">

So I modified your code to look for that tag in 2 steps. First...

<table

And then this separately...

>

In this way we can ignore all of the code that comes with the table tag (width, border etc.) because I assume it will vary between the files. After doing this we get only the code of the table. Try this...

set p to input
set ex to extractBetween(p, "<table", ">", "</table>")

to extractBetween(SearchText, startText1, startText2, endText)
    set tid to AppleScript's text item delimiters
    set AppleScript's text item delimiters to startText1
    set endItems to text item -1 of SearchText
    set AppleScript's text item delimiters to endText
    set beginningToEnd to text item 1 of endItems
    set AppleScript's text item delimiters to startText2
    set finalText to (text items 2 thru -1 of beginningToEnd) as text
    set AppleScript's text item delimiters to tid
    return finalText
end extractBetween

4 Comments

However, with a simple "get paragraphs" I was able to select the content I wanted set sourceFile to (choose file) // set newContent to read sourceFile // get paragraphs 247 thru 306 of newContent
That is a single file solution.
Not sure why it didn't work for you. It works... believe me. There must be something else you're doing. As adayzdone mentions, your solution won't work for more than the one file. The paragraph numbers will obviously change with each file.
@regulus6633 You were right, sorry. I was trying to modify the HTML tags to better suit what I was parsing and it wasn't working (not sure why, still trying). But then I then I tried the applescript with the source code you provided and it worked perfectly. Thanks
0

Try:

set xxx to read alias "Mac OS X:Users:paolo:Desktop:paolo.html"
set yyy to do shell script "echo " & quoted form of xxx & " | grep -o \\<table.*table\\>"

Comments

0

One-line wonder that works:

tell application "Safari" to set sourceCode to characters (offset of ¬
    "<table" in (source of document 1 as string)) thru ((offset of ¬
    "/table" in (source of document 1 as string)) + (count of "/table")) ¬
    of (source of document 1 as string) as string

NB Script retrieves table 1 only

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.