1

Let's say I have the following website:

https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod

When you go on this website, it displays a bunch of information. In my case, I just want to the temperature from the Culture Culture Conditions section.

when you scroll down the webpage, you will see a section called "Culture Conditions"

Atmosphere: air, 95%; carbon dioxide (CO2), 5%
Temperature: 37°C

using the requests library, I'm able to get to the HTML code of the page. when I save the HTML and search through it for my data it's towards the bottom

in this form

                                    Culture Conditions

                                </th>

    <td>



                                            <div><strong>Atmosphere: </strong>air, 95%; carbon dioxide (CO<sub>2</sub>), 5%</div><div><strong>Temperature: </strong>37&deg;C</div>

I'm not sure what to do after this. I looked into using BeautifulSoup to parse the HTML but i was not successful.

this is all the code that I have so far.

import requests

url='https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod'

page = requests.get(url)
textPage = str(page.text)

file = open('test2', 'w')
file.write(textPage)
file.close()

3 Answers 3

2
import requests
from bs4 import BeautifulSoup

url = 'https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

cc = soup.select('#layoutcontent_2_middlecontent_0_productdetailcontent_0_maincontent_2_rptTabContent_rptFields_2_fieldRow_3 td div')

for c in cc:
    print(c.text.strip())

Output:

Atmosphere: air, 95%; carbon dioxide (CO2), 5%
Temperature: 37°C

To just get the temperature:

cc = soup.select('#layoutcontent_2_middlecontent_0_productdetailcontent_0_maincontent_2_rptTabContent_rptFields_2_fieldRow_3 td div')[-1]
cc = cc.text.split(':')[-1].strip()
print(cc)

Output:

37°C
Sign up to request clarification or add additional context in comments.

Comments

1

I did a regular expression that search for the line starting by <div><strong>Atmosphere: and take all until the end of the line. Then I removed every unwanted strings from the result. Et Voila!

import re
textPage = re.search(r"<div><strong>Atmosphere: .*", textPage).group(0)
wrongString = ['<div>','</div>','<strong>','</strong>','<sub>','</sub>']
for ws in wrongString:
    textPage = re.sub(ws, "", textPage)
file = open('test2', 'w')
file.write(textPage)
file.close()

Comments

0

Another way you may find useful is something like below:

import requests
from bs4 import BeautifulSoup

url = 'https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod'

page = requests.get(url)
soup = BeautifulSoup(page.text,"lxml")
for items in soup.find_all("strong"):
    if "Atmosphere:" in items.text:
        atmos = items.find_parent().text
        temp = items.find_parent().find_next_sibling().text
        print(f'{atmos}\n{temp}')

Output:

Atmosphere: air, 95%; carbon dioxide (CO2), 5%
Temperature: 37°C

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.