How to read text off a website using python (Simple explanation)

Question

I'm looking to make a program that can get the text off a website when given the website's URL. I would like to be able to get all text between the

tags. Everywhere I have looked online seems to overcomplicate this and it involves some coding in C which I am not well versed in. To summarize what I would like the code to look like (best case scenario). If theres anything I can clarify or is unclear in the question please let me know in comments

import WebReader as WR

StringOfWebText = WR.getParagrahText("WebsiteURL")

you might want to look at scapper/crawler like beautifulsoup — Kristian
– Kristian, Commented Mar 24, 2022 at 6:18
This shouldn't be too complicated, actually: two options are bs4 (crummy.com/software/BeautifulSoup/bs4/doc) and selenium (selenium-python.readthedocs.io). If you have a specific programming problem, you could edit your post to reflect that. — nathan liang
– nathan liang, Commented Mar 24, 2022 at 6:19
The beautiful soup looks like the way to go once I have a websites HTML but how can I get the websites HTML into python using the URL — MaxVK
– MaxVK, Commented Mar 24, 2022 at 6:23
One thing of note is that if you want to get data from a website that loads from JavaScript you will need to use something that allows the JavaScript to load, like selenium — SPYBUG96
– SPYBUG96, Commented Mar 24, 2022 at 6:28

Mahdi · Accepted Answer · 2024-03-30 17:17:17Z

5

You probably want to look into something like BeautifulSoup paired with requests. You can then extract text from a page with a simple solution like this:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.text)

There's also tag-searching and other useful features built into BS4, if you need to be able to handle that.

edited Mar 30, 2024 at 17:17

Mahdi

3,2683 gold badges22 silver badges35 bronze badges

answered Mar 24, 2022 at 6:24

Grace

3453 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MaxVK Over a year ago

Thanks so much. This is the answer I have been looking for ages. Perfect and concise

Grace Over a year ago

@MaxVK No problem ^-^ glad it was helpful. Something to note about this approach, though, might include the fact that your looked-for data could be far down in the website's tree, which means that you may have to convert to a tag-search basis, but that isn't incredibly difficult with bs4, you just pass the name of the tag into .find().

KayO Over a year ago

Just wanted to point out the obvious, that the last line should read print(soup.text), not print(s.text)

Collectives™ on Stack Overflow

How to read text off a website using python (Simple explanation)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related