Webscraping HTML with Python

Question

Sorry if this is a repeat but I've been looking through a lot of StackOverflow questions on this and can't find a similar situation. I might be barking up the wrong tree here but I'm new to programming so even if someone could set me on the right path it'd help out immensely.

I'm trying to scrape data from a website that can only be accessed from inside our network using python 3.7 and Beautiful soup 4. My first question is, is this a best practice way to do it for a novice programmer or should I be looking into something like javascript instead of python?

My second question is the website's root html file has the following html tag xmlns="http://www.w3.org/1999/xhtml". Does BeautifulSoup4 work with xhtml?

I'll admit that I know nothing about web developing so even if someone can give me a few keywords or tips to start researching to get me on a more productive path it'd be appreciated. Right now my biggest problem is I don't know what I don't know and all python webscraping examples work on much simpler .html pages vs. this one where the webpages tree consists of multiple html/css/jpg and gif files.

Thanks, -Dane

I'm new to posting in the community so a reason for down-voting would be helpful to avoid me doing it again — Dane P
– Dane P, Commented Nov 11, 2018 at 23:12
Python is definitely the way to do it. A guy named Bucky Roberts used to have a website called thenewboston.com, but all I can find now is a youtube channel. Anyway, he had a series on building a web scraper using beautifulsoup. Another fellow to check out is sentdex, if I recall he also had a few tuts on using beautifulsoup. — cssyphus
– cssyphus, Commented Nov 11, 2018 at 23:36
Thanks for the useful info, I'm looking into those links now. I'm familiar with stackexchange so I expected the down-voting somewhat. I'm not as familar with stackoverflow so I just wanted to make sure I wasn't unintentionally violating some community guidelines or anything and correct it if I was. Thanks again — Dane P
– Dane P, Commented Nov 11, 2018 at 23:44
Hey Dane, you also might want to look into curl. Just google it, lots of examples — cssyphus
– cssyphus, Commented Nov 12, 2018 at 0:52
I would say that Python / Requests is the right way to do it unless it's one of those new-fangled react/angular/vue websites in which case there are some good headless chrome projects to check out. — pguardiario
– pguardiario, Commented Nov 12, 2018 at 5:20

Jamie Lindsey · Accepted Answer · 2018-11-12 02:05:08Z

1

Python, requests and BeautifulSoup are definitely the way to go, especially for a beginner. BeautifulSoup works with all variations of html, xml and so on.

You will need to install python and then install requests and bs4. Both ae easy to do by reading the requests docs and the bs4 docs.

I would suggest you learn a little of the basics of python3 if you don't know already.

Here is a little simple example to get the title of the page you request:

import requests
from bs4 import BeautifulSoup as bs

url = 'http://some.local.domain/'

response = requests.get(url)
soup = bs(response.text, 'html.parser')

# let's get title of the page
title = soup.title
print(title)

# let's get all the links in the page
links = soup.find_all('a')
for link in links:
    print(link.get('href'))
    link1 = link[0]
    link2 = link[1]

# let's follow a link we find in the page (we'll go for the first)
response = requests.get(link1, stream=True)
# if we have an image and we want to download it 
if response.status_code == 200:
    with open(url.split('/')[-1], 'wb') as f:
        for chunk in response:
            f.write(chunk)

# if the link is another web page
response = requests.get(link2)
soup = bs(response.text, 'html.parser')

# let's get title of the page
title = soup.title
print(title)

Go on the hunt for tutorials on requests, and BeautfiulSoup there are tonnes of them... like this one

edited Nov 12, 2018 at 2:05

answered Nov 12, 2018 at 1:18

Jamie Lindsey

1,0131 gold badge15 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dane P Over a year ago

Thanks Jack, you answered the bulk of my questions by saying that python was the best route. I've worked through the tutorials available online but my issue is that most of them deal with scraping a simple page with all the data contained in one html file. The page I'm trying to work with is designed with a html file containing a table with links to sub files like jpgs/html/gif which contain the data. If I go further up the site structure there's more pages containing site organization/search functions. I'm assuming site navigation issues would be solved by building a site crawler?

Jamie Lindsey Over a year ago

you just find the links with BeatifulSoup like in my example above and then use requests to follow each one, I will edit my answer to include a snippet for you to try out

Dane P Over a year ago

Ah okay. I'll work on getting it working with the site I'm trying to work with, thanks again for your help

Jamie Lindsey Over a year ago

Ive updated my code above above to help you out further.

Collectives™ on Stack Overflow

Webscraping HTML with Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related