3

I'm looking to make a program that can get the text off a website when given the website's URL. I would like to be able to get all text between the

tags. Everywhere I have looked online seems to overcomplicate this and it involves some coding in C which I am not well versed in. To summarize what I would like the code to look like (best case scenario). If theres anything I can clarify or is unclear in the question please let me know in comments

import WebReader as WR

StringOfWebText = WR.getParagrahText("WebsiteURL")

6
  • 1
    you might want to look at scapper/crawler like beautifulsoup Commented Mar 24, 2022 at 6:18
  • 1
    This shouldn't be too complicated, actually: two options are bs4 (crummy.com/software/BeautifulSoup/bs4/doc) and selenium (selenium-python.readthedocs.io). If you have a specific programming problem, you could edit your post to reflect that. Commented Mar 24, 2022 at 6:19
  • The beautiful soup looks like the way to go once I have a websites HTML but how can I get the websites HTML into python using the URL Commented Mar 24, 2022 at 6:23
  • 1
    One thing of note is that if you want to get data from a website that loads from JavaScript you will need to use something that allows the JavaScript to load, like selenium Commented Mar 24, 2022 at 6:28
  • Thanks. For now I don't need that but will keep it in mind Commented Mar 24, 2022 at 6:29

1 Answer 1

5

You probably want to look into something like BeautifulSoup paired with requests. You can then extract text from a page with a simple solution like this:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.text)

There's also tag-searching and other useful features built into BS4, if you need to be able to handle that.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks so much. This is the answer I have been looking for ages. Perfect and concise
@MaxVK No problem ^-^ glad it was helpful. Something to note about this approach, though, might include the fact that your looked-for data could be far down in the website's tree, which means that you may have to convert to a tag-search basis, but that isn't incredibly difficult with bs4, you just pass the name of the tag into .find().
Just wanted to point out the obvious, that the last line should read print(soup.text), not print(s.text)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.