Here's my current code:
private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
String userAgent1 = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43";
try {
Document doc1 = Jsoup.connect(url).userAgent(userAgent1).get();
Elements divTags = doc1.getElementsByTag("div");
String re = "^<div class=\\\"Ta\\(c\\) Py\\(6px\\) Bxz\\(bb\\) BdB Bdc\\(\\$seperatorColor\\) Miw\\(120px\\) Miw\\(100px\\)\\-\\-pnclg D\\(tbc\\)\\\" data-test=\\\"fin-col\\\"><span>.*</span></div>$";
for (Element div : divTags) {
Pattern pattern = Pattern.compile(re, Pattern.DOTALL);
Matcher matcher = pattern.matcher(div.html());
if (matcher.find()) {
String data = matcher.group(1);
Log.d("Matched: ", data);
}
else {
Log.d("Nothing Matched: ", "");
}
}
} catch (Exception e) {
Log.e("err-new", "err", e);
}
return "";
}
This function takes a URL as input, in our case: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2 and extracts all the div tags using JSOUP.
And then, I need to extract these values using Pattern matching. But, in my code above, all I get is that "Nothing matched: ".
Here's the web page from which I am interested in getting the four numeric values corresponding to the first four yearly columns, corresponding to the row named EBIT. (Stands for Earnings Before Interest and Taxes)
Link: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2
Input:
Looking to get values 122,034,000, 111,852,000, 69,964,000, 69,313,000 on the EBIT row for columns 9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019.
On Inspect, these values are under the following <div> tags.
EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>122,034,000</span></div>
EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>111,852,000</span></div>
EBIT 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>69,964,000</span></div>
EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>69,313,000</span></div>
And the same thing for the 4 columns under the Quarterly tab on the same web page. Looking to get values 25,484,000, 23,785,000, 30,830,000, 41,935,000 on the EBIT row for columns 9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021.
EBIT 1: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>25,484,000</span></div>
EBIT 2: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>23,785,000</span></div>
EBIT 3: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>30,830,000</span></div>
EBIT 4: <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>41,935,000</span></div>
Output: dates = {9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019}
datesQ = {9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021}
EBIT = {122,034,000, 111,852,000, 69,964,000, 69,313,000}
EBITQ = {25,484,000, 23,785,000, 30,830,000, 41,935,000}
Where Q stands for Quarterly.
OR, it could be two hashmaps with yearlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4} quarterlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4}
My existing code is broken. Basically, I've used JSoup to get all the javascript related tags and used a pattern matcher to get the String values I wanted. However, the page I'm parsing now seems to look like some values in that tag are encrypted strings that can't be parsed anymore.
My use case is not that complex as you can imagine. I just need the dates and the 4 values corresponding to that one row. Even if it's a non-standard, non-optimized solution, I am fine with that.
Thank you.
EBITDAthere and that too for one year. I was looking for four years' data along withEBITdata for those four years. Are you sure you sent me the right link?