1

I have written a simple screen scraping script and at the end of the script I am attempting to create an array of arrays in preparation for an activerecord insert. The structure I am trying to achieve is as follows:

Array b holds a series of 10 element arrays

b = [[0,1,2,3,4,5,6,7,8,9],[0,1,2,3,4,5,6,7,8,9],[0,1,2,3,4,5,6,7,8,9]]

Currently when I try to print out Array b the array is empty. I'm still fairly new to ruby and programming for that matter and would appreciate any feedback on how to get values in array b and to improve the overall script. Script follows:

require "rubygems"  
require "celerity"
t = 0
r = 0
c = 0
a = Array.new(10)
b = Array.new

  #initialize Browser
  browser = Celerity::IE.new
  #goto Login Page
  browser.goto('http://www1.drf.com/drfLogin.do?type=membership')
  #input UserId and Password
  browser.text_field(:name, 'p_full_name').value = 'username'
  browser.text_field(:name, 'p_password').value = 'password'
  browser.button(:index, 2).click
  #goto DRF Frontpage
  browser.goto('http://www.drf.com/frontpage')
  #goto DRF Entries
  browser.goto('http://www1.drf.com/static/indexMenus/eindex.html')
  #click the link to access the entries
  browser.link(:text, '09').click

  browser.tables.each do |table|
    t = t + 1
      browser.table(:index, t).rows.each do |row| 
        r = r + 1
          browser.table(:index, t).row(:index, r).cells.each do |cell|
            a << cell.text
          end
          b << a
          a.clear         
      end
      r = 0
  end
  puts b
  browser.close
1
  • 1
    Note that instead of posting the bulk of the script, you could easily replace the lines dealing with browser, thus producing a smaller, more self-contained example. As is, the sample isn't runnable without a username and password, and at least half of the code is extraneous, at least as far as this question is concerned. Commented Dec 15, 2010 at 3:24

3 Answers 3

2

This a minor rewrite of your main loop to a more Ruby-like way.

b = Array.new
browser.tables.each_with_index do |table, t|
  browser.table(:index, 1 + t).rows.each_with_index do |row, r|
    a = Array.new(10)
    browser.table(:index, 1 + t).row(:index, 1 + r).cells.each do |cell|
      a << cell.text
    end
    b << a
  end
end
puts b

I moved the array initializations to immediately above where they'll be needed. That's a programmer-choice thing of course.

Rather than create two counter variables up above, I switched to using each_with_index which adds an index variable, starting at 0. To get your 1-offsets I add 1.

They're not big changes but they add up to a more cohesive app.

Back to the original code: One issue I see with it is that you create your a array outside the loops then reuse it when you assign to b. That means that each time the same array gets used, but cleared and values stored to it. That will cause the previous array values to be overwritten, but resulting in duplicated arrays in b.

require 'pp'

a = []
b = []

puts a.object_id

a[0] = 1
b << a
a.clear

a[0] = 2
b << a

puts 
pp b
b.each { |ary| puts ary.object_id }
# >> 2151839900
# >> 
# >> [[2], [2]]
# >> 2151839900
# >> 2151839900

Notice that the a array gets reused repeatedly.

If I change a to a second array there are two values for b and a is two separate objects:

require 'pp'

a = []
b = []

puts a.object_id

a[0] = 1
b << a
a = []

a[0] = 2
b << a

puts 
pp b
b.each { |ary| puts ary.object_id }
# >> 2151839920
# >> 
# >> [[1], [2]]
# >> 2151839920
# >> 2151839780

Hopefully that'll help you avoid the problem in the future.

Sign up to request clarification or add additional context in comments.

2 Comments

No problem. Dealing with pointers AKA references to object or structures, is one of the first hurdles we have to clear when learning to program. They'll make sense once you grok them and are really powerful but you'll get bit by them in the meantime. Ruby insulates us from them; I don't think I could program in Perl or C without them. :-)
@Mutuelinvestor, did you get your array to populate correctly? If not update your original question with your current state of things and we'll keep working on it.
2

Your problem is there at the end:

b << a # push a *reference to* a onto b
a.clear # clear a; the reference in b now points to an empty array!

If you remove the reference to a.clear and start that loop with:

browser.tables.each do |table|
  t = t + 1
  a = []

...you'll be golden (at least as far as your array-building goes)

1 Comment

Bill thanks for the response. I tried this and I'm now getting data in the array. but unfortunately instead of getting 1 array with 10 elements (i.e. the row of data) I'm getting 10 arrays with 10 elements. Therefore I moved the a =[] down under the r = r + 1 thinking this would give me what I'm looking for and now I'm back to getting no data in the array.
1

I can't tell from your question whether you have multiple tables or not. Maybe just one? In which case:

b = browser.tables.first.rows.map {|row| row.cells.map(&:text)}

If you have multiple tables, and really want an array (tables) of arrays (rows) of arrays (cells), that would be

b = browser.tables.map {|t| t.rows.map {|row| row.cells.map(&:text)}}

And if the tables all have the same structure and you just want all the rows as if they were in one big table, you can do:

b = browser.tables.map {|t| t.rows.map {|row| row.cells.map(&:text)}}.flatten(1)

4 Comments

Glen thanks for the reply. I actually have multiple tables. I'm unfamiliar with the map command, so once I do a little research I'll fire up your suggestion. How does your response change under the multiple table scenario.
Do all the tables have the same row structure? Or do you really want an array (tables) of arrays (rows) of arrays (cells)?
The tales have the same row and cell structure. I'm inserting data into a database using a bulk insert Gem: rorstuff.blogspot.com/2010/05/… My data will typically have 8 to 10 tables with each table having somewhere between 5 and 12 row. Each row will have 10 cells (td). This being the case am I correct in assuming that your second option is consistent with what I'm trying to achieve.
As I said above, use the second version if you want 3-deep arrays (table/row/cell), the third version if you want to essentially make all the tables into one long 2-deep array.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.