1

I am working on a web scraping problem in Ruby. I have seen multiple questions and answers related to this but in none I have seen HTML that include some JavaScript framework in it and I cannot figure out how to do it. I just want to select the HTML and return an array of objects. The following is my script and the HTML code. The HTML classes of the values like name, currency, balance are similar and the question of how can it be done?

content = document.css("div.acc-list").map do |parameters| 
    name = parameters.at_css("p.s3.bold.row.acDesc").text.strip, # argument?
    currency = parameters.at_css(".row.ccy").text.strip, # argument?
    balance = parameters.at_css(".row.acyOpeningBal").text.strip # argument?
    Account.new name, currency, balance
end
pp content

These HTML paragraphs are inside multiple other classes which I think is due to the framework. However, they are inside a <div class = acc-list div>...</div> and I think I did correctly when I assigned "div.acc-list" to "content" variable.

<!-- HTML for name -->

<td bindonce="" ng-repeat="col in gridOptions.columns" sg-bind-html-compile="col.cellTemplate" bo-class="col.className" bo-style="{width: col.remWidth }" 
class="ng-scope icon-two-line-col" style="width: 17.3333rem;">
  <div style="width: 17.333333333333332rem" class="first-cell cellText ng-scope">
    <i bo-class="{'active':row.selected }" class="i-32 active icon i-circle-account"></i>
    <div class="info-wrapper" style="">
      <p class="s3 bold" bo-bind="row.acDesc">Name_value</p>   # value
      <a ui-sref="app.layout.ACCOUNTS.DETAILS.{ID}({id:'091601003439274'})" href="/Bank/accounts/details/BG37FINV91503006938102">
        <span bo-bind="row.iban">BG37FINV91503006938102</span>
        <i class="i-arrow-right-5x8"></i>
      </a>
    </div>
  </div>
</td>


<!-- HTML for currency -->

<td bindonce="" ng-repeat="col in gridOptions.columns" sg-bind-html-compile="col.cellTemplate" bo-class="col.className" bo-style="{width: col.remWidth }" 
class="ng-scope" style="width: 4.4rem;">
    <div style="width: 4.4rem" class="text-center cellText ng-scope">
        <span bo-bind="row.ccy">EUR</span>   # value
    </div>
</td>


<!-- HTML for balance -->

<td bindonce="" ng-repeat="col in gridOptions.columns" sg-bind-html-compile="col.cellTemplate" bo-class="col.className" bo-style="{width: col.remWidth }" 
class="ng-scope" style="width: 8.73333rem;">
    <div style="width: 8.733333333333333rem" class="text-right cellText ng-scope">
        <span bo-bind="row.acyAvlBal | sgCurrency">1 523.08</span>   # value
    </div>
</td>
2
  • 1
    Your approach won't work because row.acDesc is not a CSS class, so at_css definitely won't be able to use it. I would try Nokogiri's xpath syntax, which can do more complex selectors based on arbitrary attributes such as bo-bind Commented Sep 11, 2019 at 21:20
  • We need a minimal and accurate example of the HTML. Strip everything from the HTML that is not essential to the question. We also need your expected output based on that input. As is it's difficult to help because there's a lot of visual noise. "MCVE" Commented Apr 14, 2020 at 7:11

1 Answer 1

1

Using CSS:

require 'nokogiri'

document = Nokogiri::HTML(<<EOT)
<div class="acc-list">
  <!-- HTML for name -->
  <td>
    <div class="first-cell cellText ng-scope">
      <div class="info-wrapper">
        <!-- # value -->
        <p class="s3 bold">Name_value</p> 
      </div>
    </div>
  </td>


  <!-- HTML for currency -->
  <td>
    <div class="text-center cellText ng-scope">
      <!-- # value -->
      <span>EUR</span> 
    </div>
  </td>

  <!-- HTML for balance -->
  <td>
    <div class="text-right cellText ng-scope">
      <!-- # value -->
      <span>1 523.08</span> 
    </div>
  </td>
</div>
EOT

Now that the DOM is loaded:

content = document.css('div.acc-list').map do |div| 
    name = div.at("p.s3.bold").text.strip # => "Name_value"
    currency = div.at("div.text-center > span").text.strip  # => "EUR"
    balance = div.at("div.text-right > span").text.strip # => "1 523.08"
  [ name, currency, balance ]
end
# => [["Name_value", "EUR", "1 523.08"]]

Your HTML sample has a lot of extraneous information that obscures the trees in this particular forest. I stripped it out because it wasn't useful. (And, when submitting a question you should automatically do that as part of simplifying the non-essential information so we can all focus on the actual problem.)

CSS doesn't care about parameters other than the node name, class and id. The class can chain the parameters in the definition of the class if you need that granular access, but often you can get away with a more general class selector; It just depends on the HTML.

Most XML and HTML parsing is basically the same tactic: Find an outer placeholder, look inside it and iterate grabbing the information needed. I can't demonstrate that completely because your example only has the outer div, but you can probably imagineer the necessary code to handle an inner loop.

at_css is almost equivalent to at, and Nokogiri is smart enough 99.9% of the time to determine whether a selector is CSS or XPath, so I tend toward using at because my fingers are lazy.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.