2

i am screenscraping using watir and i download a xls file. when i open this file in notepad, i find that its just a bunch of html tables. is there a any function or gem that will convert this page into a bunch of arrays. any ideas is appreciated.

1
  • Show us the code. What you have and what you would like to get from it. Commented Oct 18, 2010 at 9:58

3 Answers 3

1
  1. Narrow it down to ...
  2. Clear out the whitespace
  3. Replace the tabs with "
  4. Replace tags with ",
  5. Replace the & & tags with nothing
  6. Replace the tags with |
  7. Split the rows with |
  8. Split the fields with ,

You can simplify it a little bit more, but that's the gist of it.

Sign up to request clarification or add additional context in comments.

Comments

1

In general it is a simple exercise to walk through a HTML file with a table and extract rows and columns as long as they don't use colspan or rowspan attributes. Those mess up the logical flow requiring some sensing of the gaps they cause, and a need to fill in the gaps with the repeated value from the *spans. How do I parse an HTML table with Nokogiri? might help.

From looking at XLS files on my desktop I don't think they're XML or HTML. I'm not sure what you downloaded. I did a quick search and roo (http://roo.rubyforge.org/) appears to be a good starting point.

Comments

1

XLS is a binary format. If you are seeing HTML tables in the file contents it means you probably did not download the file correctly.

How is the XLS file being downloaded through Watir? Are you having to automate the File Download window, or did you just follow a link to the XLS file and write the contents to a file?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.