0

I am doing some web scraping and come across several tables of data that I want to query against. Currently I'm up to:

$url = 'http://finance.yahoo.com/q/op?s=QQQQ&m=2012-04';
$html = @DOMDocument::loadHTMLFile($url); 
$xml = simplexml_import_dom($html); 
$results = $xml->xpath('//table[@class="yfnc_datamodoutline1"]');
var_dump($results);

Produces results: http://pastebin.com/6p3L2Kcc

This is well-ordered HTML table data, with TH and TD and everything. I'd like to use it like this:

$sql = 'SELECT Last,Open_Int FROM TABLE1 WHERE Last>25 AND Symbol LIKE "%C%"';
$results = $xmltable->sql($sql);
while($result = $results->fetch_assoc())
  echo $result['Last'] . " -- " . $result['Open_Int'] . "\n";

Without any creativity, I can write classes to parse that HTML table, take the first row, create a table in sqlite, select other rows and turn them into insert statements. But, do you know a better way to do this, or is there some powerful PHP function that I'm not seeing?

Update: Perhaps the scope here is too big. I'd be happy with a link to a library or advice on getting an HTML table in to a (proper) XML table.

4
  • Is there a good reason why you're loading the document with BOTH DOMDocument and SimpleXML? Commented Mar 5, 2011 at 19:44
  • The "simple" approach I'm referring to is: get data using this method phpro.org/examples/Parse-HTML-With-PHP-And-DOM.html and then insert data in a database. The question is: is there a better way to get it done than that? Commented Mar 5, 2011 at 19:46
  • @Mark: nope, I didn't know simplexml accepted html directly, thank you Commented Mar 5, 2011 at 19:48
  • Yeah, try simplexml_load_file() or simplexml_load_string(). For your actual question, if you really need full SQL ability, then you need to store it in a database. However, do you really need all that SQL provides? If not, you may do well to write your own function to do the queries using XPath. Commented Mar 5, 2011 at 19:54

1 Answer 1

1

The answer depends on your larger needs. Here are three questions that can flesh those out:

1) How often is the data read vs. written?

2) Do you keep old versions or is only the latest required?

3) Will the data be compared to other data?

In one case let's say the answer to #1 is "many more reads" and the answer to #3 is "yes". In this case it might be well worthwhile to put the XML results into a SQL table for frequent and flexible querying.

However, in another case, let's say the answer to #2 is "no" and the answer to #3 is "no" -- you just keep the latest retrieval and don't compare it to anything. In this case you can just stick into a file and retrieve it as needed for display (#1 becomes kind of irrelevant).

EDIT in response to question in comment: Assuming you want to put it into a database, the display you link to shows a nested set of objects/arrays. You "walk the tree" to peel out the nested objects, strip off their properties and issue individual inserts to the particular tables.

Sign up to request clarification or add additional context in comments.

1 Comment

I agree with your response. However, what you are describing is using the data after it is in a unable format. My question is asking how to get the data (currently in HTML, scraped from the web) into a useful format.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.