How to scrape just four numeric values from a HTML web page's table on Java for Android?

Question

Here's my current code:

 private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
String userAgent1 = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43";
try {
    Document doc1 = Jsoup.connect(url).userAgent(userAgent1).get();
    Elements divTags = doc1.getElementsByTag("div");
    String re = "^<div class=\\\"Ta\\(c\\) Py\\(6px\\) Bxz\\(bb\\) BdB Bdc\\(\\$seperatorColor\\) Miw\\(120px\\) Miw\\(100px\\)\\-\\-pnclg D\\(tbc\\)\\\" data-test=\\\"fin-col\\\"><span>.*</span></div>$";
    
    for (Element div : divTags) {
        Pattern pattern = Pattern.compile(re, Pattern.DOTALL);
        Matcher matcher = pattern.matcher(div.html());

        if (matcher.find()) {
            String data = matcher.group(1);
            Log.d("Matched: ", data);
        }
        else {
            Log.d("Nothing Matched: ", "");
        }
    }
} catch (Exception e) {
    Log.e("err-new", "err", e);
}
return "";

}

This function takes a URL as input, in our case: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2 and extracts all the div tags using JSOUP.

And then, I need to extract these values using Pattern matching. But, in my code above, all I get is that "Nothing matched: ".

Here's the web page from which I am interested in getting the four numeric values corresponding to the first four yearly columns, corresponding to the row named EBIT. (Stands for Earnings Before Interest and Taxes)

Link: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2

Input: Looking to get values 122,034,000, 111,852,000, 69,964,000, 69,313,000 on the EBIT row for columns 9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019.

On Inspect, these values are under the following <div> tags.

EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>122,034,000</span></div>

EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>111,852,000</span></div>

EBIT 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>69,964,000</span></div>

EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>69,313,000</span></div>

And the same thing for the 4 columns under the Quarterly tab on the same web page. Looking to get values 25,484,000, 23,785,000, 30,830,000, 41,935,000 on the EBIT row for columns 9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021.

EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>25,484,000</span></div>

EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>23,785,000</span></div>

EBIT 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>30,830,000</span></div>

EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>41,935,000</span></div>

Output: dates = {9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019}

datesQ = {9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021}

EBIT = {122,034,000, 111,852,000, 69,964,000, 69,313,000}

EBITQ = {25,484,000, 23,785,000, 30,830,000, 41,935,000}

Where Q stands for Quarterly.

OR, it could be two hashmaps with yearlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4} quarterlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4}

My existing code is broken. Basically, I've used JSoup to get all the javascript related tags and used a pattern matcher to get the String values I wanted. However, the page I'm parsing now seems to look like some values in that tag are encrypted strings that can't be parsed anymore.

My use case is not that complex as you can imagine. I just need the dates and the 4 values corresponding to that one row. Even if it's a non-standard, non-optimized solution, I am fine with that.

Thank you.

I never used the data of this API by myself, but usually in every system in this world, the data that you see, is being exposed as an Rest API. Here you can consume the data on a standard way by parsing the JSON, instead of extracting some weird HTML out of the page. cryptocointracker.com/yahoo-finance/yahoo-finance-api — Radu M
– Radu M, Commented Jan 26, 2023 at 10:13
For example this one would give you the financial data from AAPL which contains the EBIT. query1.finance.yahoo.com/v11/finance/quoteSummary/… — Radu M
– Radu M, Commented Jan 26, 2023 at 10:15
I only see EBITDA there and that too for one year. I was looking for four years' data along with EBIT data for those four years. Are you sure you sent me the right link? — Zac1
– Zac1, Commented Jan 28, 2023 at 16:40
I just sent you an example / hint - as I mentioned, I never worked with the API by myself, so I am not quite sure, if the API has everything you want, so its your job to look up if it has any endpoints that give you exact the same data, or if you need to calculate it by yourself by taking some other data from the API. Usually scrapping HTML is not necessary, if the API might expose the data in a JSON Format. But if it doesn't - sure HTML Scrapping is a way. — Radu M
– Radu M, Commented Jan 28, 2023 at 19:21

user11847513 · Accepted Answer · 2023-01-27 16:39:16Z

0

+50

I guess you can use regular expression to match the div tags

Please change your regular expression to match the span element and extract the text inside it.

ex:

Elements spans = doc1.select("div.Ta(c) span");
for (Element span : spans) {
    String data = span.text();
    Log.d("Matched: ", data);
}

Also you might use Jsoup's elements class & filter method to filter the divs to extract the span elements.

Elements divs = doc1.select("div[class*=Ta\\(c\\)]");
Elements spanElements = divs.filter(element -> element.select("span").size()>0);
for (Element span : spanElements) {
    String data = span.text();
    Log.d("Matched: ", data);
}

Using Css selectors will be also possible.

answered Jan 27, 2023 at 16:39

user11847513

Sign up to request clarification or add additional context in comments.

4 Comments

Zac1 Over a year ago

Sorry, where in my function (in the question) should I put this? I have a for loop that get div tags...

Zac1 Over a year ago

Also, it just throws an error: "org.jsoup.select.Selector$SelectorParseException: Could not parse query 'div.Ta(c) span': unexpected token at '(c) span'"

user11847513 Over a year ago

You can place the beautifulsoup.select_one() method within the for loop, after the divs=soup.find_all("div", clas_="class-name") line.

user11847513 Over a year ago

like this: for div in divs: div_class = beautifulsoup.select_one(div, "div.class-name")

John Williams · Accepted Answer · 2023-01-26 14:19:22Z

0

Annoyingly the annual data is on the page as loaded and the quarterly data is loaded with a AJAX call triggered by clicking on the "Quarterly" button. Anyway, the following code will do the job:

import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.text.NumberFormat;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.google.gson.Gson;

public class App {
    private static final String PAGE_URL = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2";
    private static final String DATA_URL = "https://query1.finance.yahoo.com/ws/fundamentals-timeseries/v1/finance/timeseries/AAPL?lang=en-US&region=US&symbol=AAPL&padTimeSeries=true&type=quarterlyEBIT&merge=false&period1=493590046&period2=1674660504&corsDomain=finance.yahoo.com";

    private static final String REGEX_YAHOO_PAGE_EBIT = "^.*ttm</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?EBIT</span></div><div.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*$";
    private static final Pattern PATTERN_YAHOO_PAGE_REGEX = Pattern.compile(REGEX_YAHOO_PAGE_EBIT, Pattern.DOTALL);

    private static final Gson GSON = new Gson();

    private static final NumberFormat NUMBER_FORMAT = NumberFormat.getInstance(new Locale("en", "US"));

    public static void main(String[] args) throws IOException {
        String pageContent = fetch(PAGE_URL);
        Matcher m = PATTERN_YAHOO_PAGE_REGEX.matcher(pageContent);
        if (m.matches()) {
            System.out.println("Annual values");

            System.out.println(m.group(1) + ": " + m.group(6));
            System.out.println(m.group(2) + ": " + m.group(7));
            System.out.println(m.group(3) + ": " + m.group(8));
            System.out.println(m.group(4) + ": " + m.group(9));
        }

        // the quarterly data is not on the page. it is rendered dynamically from this
        // AJAX call
        String quarterlyData = fetch(DATA_URL);
        System.out.println("Quarterly values");
        Map map = GSON.fromJson(quarterlyData, Map.class);
        List<Map> result = (List<Map>) ((Map) map.get("timeseries")).get("result");
        for (Map entry : result) {
            Map meta = (Map) entry.get("meta");
            if (((List<String>) meta.get("type")).get(0).equals("quarterlyEBIT")) {
                List<Map<String, Object>> quarterlyEBIT = (List) entry.get("quarterlyEBIT");
                for (Map<String, Object> cell : quarterlyEBIT) {
                    System.out.print(cell.get("asOfDate") + ": ");
                    String fullNumberString = NUMBER_FORMAT
                            .format(((Map<String, Double>) cell.get("reportedValue")).get("raw"));
                    System.out.println(fullNumberString.substring(0, fullNumberString.length() - 4));

                }

            }
        }

    }

    private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
        URL pageUrl = new URL(url);
        HttpURLConnection pageConnection = (HttpURLConnection) pageUrl.openConnection();
        try {
            InputStream inputStream = new BufferedInputStream(pageConnection.getInputStream());
            int bufferSize = 1024;
            char[] buffer = new char[bufferSize];
            StringBuilder out = new StringBuilder();
            Reader in = new InputStreamReader(inputStream, "UTF-8");
            for (int numRead; (numRead = in.read(buffer, 0, buffer.length)) > 0;) {
                out.append(buffer, 0, numRead);
            }
            return out.toString();
        } finally {
            pageConnection.disconnect();
        }
    }
}

Output:

Annual values
9/30/2022: 122,034,000
9/30/2021: 111,852,000
9/30/2020: 69,964,000
9/30/2019: 69,313,000
Quarterly values
2021-12-31: 41,935,000
2022-03-31: 30,830,000
2022-06-30: 23,785,000
2022-09-30: 25,484,000

If you prefer Apache HttpClient (v4 here) then fetch() can be coded as follows:

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

    private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet(url);
        CloseableHttpResponse response = httpclient.execute(httpGet);
        try {
            HttpEntity entity = response.getEntity();
            return EntityUtils.toString(entity);
        } finally {
            response.close();
        }
    }

edited Jan 26, 2023 at 14:19

answered Jan 25, 2023 at 19:06

John Williams

5,8032 gold badges12 silver badges27 bronze badges

25 Comments

Zac1 Over a year ago

Thank you. I am a total beginner. So, is there a way you can help rewrite this to use Java for Android, esp. the fetch method? The fetch method in your code doesn't directly translate to Android. I'd like to know what the output of the out.toString() string looks like that's used for parsing.

John Williams Over a year ago

I specially wrote it to work with Android. Which classes do you not have access to? java.net? java.util.regex? com.google.gson.Gson?

John Williams Over a year ago

fetch() returns the body of the HTTP GET just like a call from the browser, eg DATA_URL returns {"timeseries":{"result":[{"meta":{"symbol":["AAPL"],"type":["quarterlyEBIT"]},"timestamp":[1640908800,1648684800,1656547200,1664496000],"quarterlyEBIT":[{"dataId":20189,"asOfDate":"2021-12-31","periodType":"3M","currencyCode":"USD","reportedValue":{"raw":4.1935E10,"fmt":"41.94B"}},{"dataId":20189,"asOfDate":"2022-03-31","periodType":"3M","currencyCode":"USD","reportedValue":{"raw":3.083E10,"fmt":"30.83B"}},{"dataId":20189,"asOfDate":"2022-06-30","periodType":"3M","currencyCode":"USD",… TRUNCATED

John Williams Over a year ago

I can rewrite fetch() to use Apache Http Client but that is very old school. See hc.apache.org/httpcomponents-client-4.5.x/android.html

John Williams Over a year ago

Which version of Java are you using? 1.8^ ?

|

Collectives™ on Stack Overflow

How to scrape just four numeric values from a HTML web page's table on Java for Android?

2 Answers 2

4 Comments

25 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

25 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related