How to extract element from HTML code in Python

Question

I'm trying to webscrape multiple webpages of similar HTML code. I can already get the HTML of each page and I can manually find the part of the code's string where the information I need is placed - I just don't know how to properly extract it. I believe my problem might be solved with REGEX, actually, but I don't know how.

I'm using Python 3

This is how I extract the page's HTML code:

import requests
resp = requests.get("https://statusinvest.com.br/fundos-imobiliarios/knri11",headers={'User-Agent': 'Mozilla/5.0'})

from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.content, features="html.parser")

Below is the string of the HTML code ( code -> str(soup) ). I want to extract the list between those two pink brackets. This block is always after the line between blue parenthesis (the text in green is different at each page) part of page's HTML code I want to extract

Andrej Kesely · Accepted Answer · 2022-11-28 00:19:58Z

You can use beautifulsoup to find the correct tag and json module to parse the values:

import json
import requests
from bs4 import BeautifulSoup

resp = requests.get(
    "https://statusinvest.com.br/fundos-imobiliarios/knri11",
    headers={"User-Agent": "Mozilla/5.0"},
)
soup = BeautifulSoup(resp.content, "html.parser")

data = json.loads(soup.select_one("#results")["value"])

print(data)

Prints:

[
    {
        "y": 0,
        "m": 0,
        "d": 0,
        "ad": None,
        "ed": "31/10/2022",
        "pd": "16/11/2022",
        "et": "Rendimento",
        "etd": "Rendimento",
        "v": 0.91,
        "ov": None,
        "sv": "0,91000000",
        "sov": "-",
        "adj": False,
    },
    {
        "y": 0,
        "m": 0,
        "d": 0,
        "ad": None,
        "ed": "30/09/2022",
        "pd": "17/10/2022",
        "et": "Rendimento",
        "etd": "Rendimento",
        "v": 0.91,
        "ov": None,
        "sv": "0,91000000",
        "sov": "-",
        "adj": False,
    },


...and so on.

Leo · Accepted Answer · 2022-11-28 00:43:38Z

1

import json
import requests

resp = requests.get("https://statusinvest.com.br/fundos-imobiliarios/knri11", headers={'User-Agent': 'Mozilla/5.0'})

from bs4 import BeautifulSoup

soup = BeautifulSoup(resp.content, features="html.parser")
data = json.loads(soup.find("input", {"id": "results"}).get("value")
print(data)

To get the first value:

print(data[0]["y"])

edited Nov 28, 2022 at 0:43

answered Nov 28, 2022 at 0:37

Leo

4042 silver badges6 bronze badges

Collectives™ on Stack Overflow

How to extract element from HTML code in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related