7

How can I copy the source code of a website into a text file in Python 3?

EDIT: To clarify my issue, here's what I have:

import urllib.request

def extractHTML(url):
    f = open('temphtml.txt', 'w')
    page = urllib.request.urlopen(url)
    pagetext = page.read()
    f.write(pagetext)
    f.close()

extractHTML('http:www.google.com')

I get the following error for the f.write() function:

builtins.TypeError: must be str, not bytes
4
  • Have you tried looking here?: stackoverflow.com/questions/5512811/… Commented Apr 1, 2012 at 21:08
  • Surprisingly, none of the answers (except one) actually addressed the issue.. pagetext is NOT a string.. It's actually bytes. So to convert it to a string, you need to use f.write(pagetext.decode('utf-8')) which will a UTF-8 encoded string to the file. Commented Oct 12, 2017 at 23:57
  • @Brandon I tried what you said and got an error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 8482: invalid start byte. I just literally copied down my answer without the str() and put f.write(pagetext.decode('utf-8')) in the place of f.write(pagetext). Any idea why this is not working for me. If you are using Python 2 that might be why Commented Oct 13, 2017 at 0:40
  • Does this answer your question? Save HTML of some website in a txt file with python Commented Oct 24, 2020 at 9:06

3 Answers 3

3
import urllib.request
site = urllib.request.urlopen('http://somesite.com')
data = site.read()
file = open("file.txt","wb") #open file in binary mode
file.writelines(data)
file.close()

Untested but should work.

EDIT: Updated for python3

Sign up to request clarification or add additional context in comments.

8 Comments

Oops, sorry. What's the issue in python 3?
urllib2 doesn't exist, for starters. I think typically you'd use the urllib.request module (that's where urlopen now lives.)
Oops, seems this is redundant now that OP has updated their post.
I think you will have the same str/bytes problem. The HTTP response has bytes, but you've opened the file for writing str. The simplest way is just to open the file in binary mode (with "wb").
Using wb gives me this error for notalwaysright.com/page/1: TypeError: 'int' does not support the buffer interface
|
1

Try this.

import urllib.request
def extractHTML(url):
    urllib.request.urlretrieve(url, 'temphtml.txt')

It is easier, but if you still want to do it that way. This is the solution:

import urllib.request

def extractHTML(url):
    f = open('temphtml.txt', 'w')
    page = urllib.request.urlopen(url)
    pagetext = str(page.read())
    f.write(pagetext)
    f.close()

extractHTML('https://www.google.com')

Your script gave an error saying it must be a string. Just convert bytes to a string with str().

Next I got an error saying no host was given. Google is a secured site so https: not http: and most importantly you forgot to include // at the end of https:.

Comments

0

probably you wanted to create something like that:

import urllib.request

class ExtractHtml():

    def Page(self):

        print("enter the web page name starting with 'http://': ")
        url=input()
        site=urllib.request.urlopen(url)
        data=site.read()
        file =open("D://python_projects/output.txt", "wb")
        file.write(data)
        file.close()






w=ExtractHtml()
w.Page()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.