1

I'm tryring to download a txt file using python and sockets, but error occurs when I decodes the content I get.

I'm using python3 and running test.py on windows, trying to fetch the content of http://linux.vbird.org/linux_basic/0330regularex/regular_express.txt

 python .\test.py linux.vbird.org 80 /linux_basic/0330regularex/regular_express.txt
# this file is named test.py
import socket
import sys

host = sys.argv[1]
port = sys.argv[2]
filename = sys.argv[3]
# creating a socket, using ipv4
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# connecting
s.connect((host, int(port)))
print("Connecting successful!\n")
str = "GET %s HTTP/1.0\r\n\r\n" % filename
s.sendall(str.encode('utf-8'))
while 1:
    try:
        buf = s.recv(2048)
    except socket.error as e:
        print("Error receiving data: %s" % e)
        sys.exit(1)
    if not len(buf):
        break
    sys.stdout.write(buf.decode('utf-8'))

I expected to get the content of given url,namely, the content of the txt file ,however, the error message is following:


Connecting successful!

Traceback (most recent call last): File ".\test.py", line 22, in sys.stdout.write(buf.decode('utf-8')) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 275: invalid start byte


4
  • text can be in different encoding then utf-8 - ie. latin1, cp1250, etc. Commented Oct 25, 2019 at 6:35
  • how do I firgure out that file's encoding? further more , what if I don't know source url's encoding? Commented Oct 25, 2019 at 7:18
  • Using chardet I can receive correct data, but there is another problem. Commented Oct 25, 2019 at 8:19
  • that is the website informs me to use linux.vbird.org, DO NOT USE vbird.org .Why this message occurs? My parameter to my problem is truly linux.bvird.org it's confusing... thanks. Commented Oct 25, 2019 at 8:21

2 Answers 2

1

The HTTP header is ASCII and at most iso-8859-1 (single byte encoding of "ü" etc). It is not utf-8 (multi-byte encoding of "ü" etc). The encoding of the HTTP body can be anything, i.e. the body should be treated as bytes as long as the encoding is unknown.

The encoding can be given in the "charset" attribute in the Content-Type response header in case of text or HTML. It is not required though. In case of HTML it can also be given inside a meta tag. If it is not given the recipient might use defaults (which might not fit the actual encoding) or use heuristics to guess the encoding.

Sign up to request clarification or add additional context in comments.

Comments

1

Originally it was answer to your question in comment about message "DO NOT USE vbird.org"- but finally it resolved other problem too.


linux.vbird.org and vbird.org have the same IP. They are on one server.

Socket converts linux.vbird.org to IP and it uses IP to connect to server - so server doesn't know that you want to get file from linux.vbird.org. It thinks that you want from vbird.org which is main domain. linux.vbird.org is only subdomain in domain vbird.org.

You would have to use header host: linux.vbird.org in request to inform server from what subdomain you try to get file.

GET /linux_basic/0330regularex/regular_express.txt HTTP/1.0
Host: linux.vbird.org

With this header it sends your file.

I tested this header with your code and accidently it resolves problem with encoding because your file is in UTF-8 and server send it as UTF-8 and there is no problem with buf.decode('utf-8')


import socket
import sys

host = 'linux.vbird.org' 
port = '80'
filename = '/linux_basic/0330regularex/regular_express.txt'

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((host, int(port)))
print("Connecting successful!\n")

str = "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (filename,host)
print(str)

s.sendall(str.encode('utf-8'))
while True:
    try:
        buf = s.recv(2048)
    except socket.error as e:
        print("Error receiving data: %s" % e)
        sys.exit(1)
    if not len(buf):
        break

    #print(buf)
    sys.stdout.write(buf.decode('utf-8'))

4 Comments

after give out host as linux.vbird.org in str, I successfully get the right result. A question is I found that in sendall() function, the parameter type must be bytes, what is the common method to deal with this situation, in my code, I use encode('utf-8') to transform it into bytes type, another question is, buf is what I get from server, and I don't know its encoding, should I guess it's encoding style first by some tool like chardet then print it?
standard method is to use encode('utf-8') to convert it to bytes. Currently probably most servers use utf-8 as default encoding to send HTML but sometimes you can met older starndards like cp1250 or iso-8859-1 for Windows servers. So you can use try/except with different encodings or eventaully use chardet. But if you get file from server then normally you don't display it but you save it on disk without encoding - you open file to write in bytes mode - open(..., 'wb') - so you don't have to care of encoding problems.
sometimes server may also send in response's header information what encoding it used to sends text/HTML data.
if you use module requests then mostly you don't have to care of encoding because it try to recognize encoding - it can try to find it in response's header or in HTML tag <meta charset="...">, etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.