Save HTML Source Code to File

Question

How can I copy the source code of a website into a text file in Python 3?

EDIT: To clarify my issue, here's what I have:

import urllib.request

def extractHTML(url):
    f = open('temphtml.txt', 'w')
    page = urllib.request.urlopen(url)
    pagetext = page.read()
    f.write(pagetext)
    f.close()

extractHTML('http:www.google.com')

I get the following error for the f.write() function:

builtins.TypeError: must be str, not bytes

Have you tried looking here?: stackoverflow.com/questions/5512811/… — Jack
– Jack, Commented Apr 1, 2012 at 21:08
Surprisingly, none of the answers (except one) actually addressed the issue.. pagetext is NOT a string.. It's actually bytes. So to convert it to a string, you need to use f.write(pagetext.decode('utf-8')) which will a UTF-8 encoded string to the file. — Brandon
– Brandon, Commented Oct 12, 2017 at 23:57
@Brandon I tried what you said and got an error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 8482: invalid start byte. I just literally copied down my answer without the str() and put f.write(pagetext.decode('utf-8')) in the place of f.write(pagetext). Any idea why this is not working for me. If you are using Python 2 that might be why — Xantium
– Xantium, Commented Oct 13, 2017 at 0:40
Does this answer your question? Save HTML of some website in a txt file with python — Gino Mempin
– Gino Mempin, Commented Oct 24, 2020 at 9:06

Jack · Accepted Answer · 2012-04-02 07:19:33Z

3

import urllib.request
site = urllib.request.urlopen('http://somesite.com')
data = site.read()
file = open("file.txt","wb") #open file in binary mode
file.writelines(data)
file.close()

Untested but should work.

EDIT: Updated for python3

edited Apr 2, 2012 at 7:19

answered Apr 1, 2012 at 20:43

Jack

7605 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Jack Over a year ago

Oops, sorry. What's the issue in python 3?

DSM Over a year ago

urllib2 doesn't exist, for starters. I think typically you'd use the urllib.request module (that's where urlopen now lives.)

Jack Over a year ago

Oops, seems this is redundant now that OP has updated their post.

Thomas K Over a year ago

I think you will have the same str/bytes problem. The HTTP response has bytes, but you've opened the file for writing str. The simplest way is just to open the file in binary mode (with "wb").

rassa45 Over a year ago

Using wb gives me this error for notalwaysright.com/page/1: TypeError: 'int' does not support the buffer interface

|

Xantium · Accepted Answer · 2017-10-12 23:46:29Z

1

Try this.

import urllib.request
def extractHTML(url):
    urllib.request.urlretrieve(url, 'temphtml.txt')

It is easier, but if you still want to do it that way. This is the solution:

import urllib.request

def extractHTML(url):
    f = open('temphtml.txt', 'w')
    page = urllib.request.urlopen(url)
    pagetext = str(page.read())
    f.write(pagetext)
    f.close()

extractHTML('https://www.google.com')

Your script gave an error saying it must be a string. Just convert bytes to a string with str().

Next I got an error saying no host was given. Google is a secured site so https: not http: and most importantly you forgot to include // at the end of https:.

answered Oct 12, 2017 at 23:46

Xantium

11.7k12 gold badges72 silver badges96 bronze badges

Comments

user3105498 · Accepted Answer · 2013-12-16 01:50:49Z

0

probably you wanted to create something like that:

import urllib.request

class ExtractHtml():

    def Page(self):

        print("enter the web page name starting with 'http://': ")
        url=input()
        site=urllib.request.urlopen(url)
        data=site.read()
        file =open("D://python_projects/output.txt", "wb")
        file.write(data)
        file.close()






w=ExtractHtml()
w.Page()

answered Dec 16, 2013 at 1:50

user3105498

11

Collectives™ on Stack Overflow

Save HTML Source Code to File

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related