I am trying to scrape the following html.
There are multiple divs where class="review-card".
Each of these divs always contain a script element where data-initial-state="data-always-exist" and sometimes contain a script element where data-initial-state="data-may-not-exist".
I would like to retrieve the data from both of these script elements. When the second one does not exist I want to return a specific value e.g. 0.
As you can see on my code below, I have managed to find the "retrieve-card" div elements. However, I fail to retrieve the script element that live inside each div element. My code always returns a list vs a single element. What am I doing wrong?
<html>
<body>
<main>
<div class="review-list">
<div class="review-card">
<article class="review">
<script type="application.json" data-initial-state="data-always-exist">
{"reviewBody":"Brilliant value","stars":5}
</script>
<section class="review__content">
<div class="content">
<script type="application.json" data-initial-state="data-may-not-exist">
{"isVerified":true,"verificationSource":"invitation"}
</script>
</div>
</section>
</article>
</div>
<div class="review-card">
<article class="review">
<script type="application.json data-initial-state="data-always-exist">
{"reviewBody":"Brilliant value","stars":5}
</script>
</article>
</div>
<div class="review-card">
<article class="review">
<script type="application.json" data-initial-state="data-always-exist">
{"reviewBody":"Great","stars":4}
</script>
<section class="review__content">
<div class="content">
<script type="application.json" data-initial-state="data-may-not-exist">
{"isVerified":false,"verificationSource":"invitation"}
</script>
</div>
</section>
</article>
</div>
</div>
</main>
</body>
</html>
I have tried the following:
from lxml import html
import requests
page = requests.get('http://somewebsite.com')
tree = html.fromstring(page.content)
#finds the review list
review_list = tree.xpath('//div[@class="review-list"]')
#finds all the review cards
review_cards = review_list[0].xpath('//div[contains(@class,"review-card")]')
for card in review_cards:
#this part of the code does not work as intended -returns a list vs a single items.
data_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-always-exist')]")
data_not_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-may-not-exist')]")
beautifulsoup?