Python Web Crawlers and "getting" html source code

Question

So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.

Just for background, I need to download a page and replace any img with ones I have

And it would be nice if you guys could tell me your opinion of 2.7 and 3.1

Which Python module or library are you using? What is this get you speak of? — David Z
– David Z, Commented Aug 20, 2010 at 18:06

leoluk · Accepted Answer · 2015-08-18 18:57:55Z

48

~~Use Python 2.7, is has more 3rd party libs at the moment.~~ (Edit: see below).

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example:

import urllib2

response = urllib2.urlopen("http://google.de")
page_source = response.read()

For parsing the code, have a look at BeautifulSoup.

BTW: what exactly do you want to do:

Just for background, I need to download a page and replace any img with ones I have

Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.

edited Aug 18, 2015 at 18:57

answered Aug 20, 2010 at 18:15

leoluk

13k6 gold badges47 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

aaronasterling Over a year ago

just to nitpick, what you get back from urlopen isn't a request object, it's a response object.

vaibhavatul47 Over a year ago

Python 3.4, To install requests: pip install requests

Timo · Accepted Answer · 2018-03-15 08:24:25Z

11

An Example with python3 and the requests library as mentioned by @leoluk:

pip install requests

Script req.py:

import requests

url='http://localhost'

# in case you need a session
cd = { 'sessionid': '123..'}

r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content

Now,execute it and you will get the html source of localhost!

python3 req.py

answered Mar 15, 2018 at 8:24

Timo

3,2533 gold badges36 silver badges34 bronze badges

Comments

Caner · Accepted Answer · 2020-05-27 14:13:44Z

6

If you are using Python > 3.x you don't need to install any libraries, this is directly built in the python framework. The old urllib2 package has been renamed to urllib:

from urllib import request

response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)

edited May 27, 2020 at 14:13

answered May 27, 2020 at 14:06

Caner

59.7k37 gold badges184 silver badges185 bronze badges

Comments

Jim Garrison · Accepted Answer · 2010-08-20 18:14:38Z

0

The first thing you need to do is read the HTTP spec which will explain what you can expect to receive over the wire. The data returned inside the content will be the "rendered" web page, not the source. The source could be a JSP, a servlet, a CGI script, in short, just about anything, and you have no access to that. You only get the HTML that the server sent you. In the case of a static HTML page, then yes, you will be seeing the "source". But for anything else you see the generated HTML, not the source.

When you say modify the page and return the modified page what do you mean?

answered Aug 20, 2010 at 18:14

Jim Garrison

87k20 gold badges162 silver badges197 bronze badges

3 Comments

Dan Over a year ago

for all img files on a certain page, replace with a new one

Dan Over a year ago

the link you sent me is very big. What are the minimums i should read

Jim Garrison Over a year ago

Google search for information about HTTP. This is the underlying protocol that carries the HTML from the server to your browser. I assume you already understand HTML and have a strategy for parsing it. If not, all the pieces are available but you will have some research and learning to do to put them together.

pebox11 · Accepted Answer · 2023-08-26 21:43:52Z

0

All the above will fail on an https request behind Cloudflare. You can try this to fetch both http and https html:

import requests
url = 'https://your.link.here'   
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    print(response.text)
else:
    print(f'Request failed with status code: {response.status_code}')

answered Aug 26, 2023 at 21:43

pebox11

3,8085 gold badges37 silver badges60 bronze badges

Comments

LO FERRAN · Accepted Answer · 2024-05-17 11:10:33Z

0

Here you have a code to this task:

import requests
from requests.exceptions import RequestException
from datetime import datetime
import urllib.parse

def fetch_url(url, retries=3):

headers = {
    "User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
}

for attempt in range(retries):
    try:
        response = requests.get(url, headers=headers, timeout=10, allow_redirects=True)
        if response.status_code == 200:
            response.encoding = response.apparent_encoding
            return response.text
        else:
            print(f"Error: {response.status_code}")
    except RequestException as e:
        print(f"Attempt {attempt + 1} failed: {e}")

return None

def get_filename_from_url(url):

parsed_url = urllib.parse.urlparse(url)
domain = parsed_url.netloc.replace("www.", "")
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
filename = f"{domain}_{timestamp}.html"
return filename

url = input("Introduce la URL: ")

source_code = fetch_url(url)

if source_code:
    filename = get_filename_from_url(url)
    with open(filename, "w", encoding="utf-8") as file:
        file.write(source_code)
    print(f"El código fuente se ha guardado en {filename}")
else:
    print("Failed to retrieve the webpage after multiple attempts.")

answered May 17, 2024 at 11:10

LO FERRAN

1

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

Python Web Crawlers and "getting" html source code

6 Answers 6

2 Comments

Comments

Comments

3 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

Comments

Comments

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related