Python Read String from File with Strange Encoding

Question

I made a pig latin translator that takes input from the user, translates it, and returns it. I want to add the ability to input a text file to take text from but I'm running into an issue that the file isn't being opened as I expect. Here is my code:

from sys import argv
script, filename = argv

file = open(filename, "r")

sentence = file.read()

print sentence

file.close()

The problem is that when I print out the information inside the file it looks like this:

■T h i s   i s   s o m e   t e x t   i n   a   f i l e

Instead of this:

This is some text in a file

I know I could do a workaround the spaces and the odd square character with slicing, but I feel like that is treating a symptom and I want to understand why the text is formatted weird so maybe I can fix the cause.

Hey, so I can be a bit more accurate in my answer, could you edit your post and put the result of hexdump <filename> and file <filename> from the command line? Assuming you aren't on Windows. — Will
– Will, Commented Jan 4, 2016 at 3:17
Or at least tell us the program you used to make that text file. — PM 2Ring
– PM 2Ring, Commented Jan 4, 2016 at 3:58
I used notepad++. As to doing the hexdump I will do that when I get home. — Supetorus
– Supetorus, Commented Jan 4, 2016 at 15:53
I tried doing the hexdump <filename> and it said the command hexdump is not recognized. Same happened with file. — Supetorus
– Supetorus, Commented Jan 5, 2016 at 16:18
@Supetorus: hexdump and file are standard commands on Unix-like systems, which is why Will said "Assuming you aren't on Windows". — PM 2Ring
– PM 2Ring, Commented Jan 6, 2016 at 6:09

Will · Accepted Answer · 2016-01-04 03:58:22Z

4

I believe this is a Unicode UTF-16 encoded file, and this is the "Unicode Byte Order Mark" (BOM). It could also be another encoding with a byte-order mark, but it definitely appears to be a multi-byte encoding.

This is also why you're seeing the whitespace between characters. UTF-16 effectively represents each character as two bytes, but for standard ASCII characters like you're using, the other half of the character is empty (second byte is 0).

Try this instead:

from sys import argv
import codecs
script, filename = argv

file = codecs.open(filename, encoding='utf-16')
sentence = file.read()
print sentence
file.close()

Replace encoding='utf-16' with whatever encoding this actually is. You might just need to try a few and experiment.

edited Jan 4, 2016 at 3:58

answered Jan 4, 2016 at 3:10

Will

24.8k14 gold badges100 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

PM 2Ring Over a year ago

Typo: the other half of the character is zero. Your answer looks good, pity the OP is not responding.

Will Over a year ago

Thanks! Corrected :) Yeah, I'm hoping OP can show the output of hexdump <file> and file <file> so we can figure out more clearly what the encoding is. I think SO converted it to UTF-8, and hexdumping it on my end doesn't show any known byte-order mark.

Supetorus Over a year ago

Aha. I checked my notepad++ settings and it creates files in UTF-8. I used the codecs stuff to fix it and it worked. Also I tried using hexdump and file as I told another guy in a comment above and both said the command is not recognized. Perhaps I didn't use it as you meant. I typed it directly into powershell as hexdump text.txt text.txt being the name of my file. Same with file.

Mark Tolonen · Accepted Answer · 2016-01-04 03:51:35Z

2

The original file is UTF-16. Here's an example that writes a UTF-16 file and reads it with open vs. io.open, which takes an encoding parameter:

#!python2
import io

sentence = u'This is some text in a file'

with io.open('file.txt','w',encoding='utf16') as f:
    f.write(sentence)

with open('file.txt') as f:
    print f.read()

with io.open('file.txt','r',encoding='utf16') as f:
    print f.read()

Output on US Windows 7 console:

 ■T h i s   i s   s o m e   t e x t   i n   a   f i l e
This is some text in a file

As a guess, I'd say the OP created the text file in Windows Notepad and saved it as "Unicode", which is Microsoft's misnomer for UTF-16 encoding.

answered Jan 4, 2016 at 3:51

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

2 Comments

Will Over a year ago

Great answer! What confused me is that when I tried to hexdump the BOM from the post text, it didn't seem to be a UTF-16 BOM. But I'm guessing that's just because SO uses UTF-8 :)

Mark Tolonen Over a year ago

@Will, the post text was likely decoded in cp437, since that's what my terminal is, and the ■ character is FEh in that encoding. That's part of the UTF-16 BOM :)

Supetorus · Accepted Answer · 2016-01-05 16:36:30Z

1

At first when I saw everyone responding with stuff about unicode and utf I shied away from reading and trying to fix it, but I'm persistent about learning to program in python so I did some research, primarily this website. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

That was really helpful. So what I can gather is that notepad++ which I used to write the text file, wrote it in UTF-8, and python read it in UTF-16. The solution was to import codecs, and use the codecs function like this (as Will said above): from sys import argv import codecs

script, filename = argv

file = codecs.open(filename, encoding = "utf-8")

sentence = file.read()

print sentence

file.close()

answered Jan 5, 2016 at 16:36

Supetorus

792 silver badges11 bronze badges

1 Comment

PM 2Ring Over a year ago

Weird. UTF-8 shouldn't produce those spaces for chars in the ASCII range. But anyway... As well as that article by SO co-founder Joel, you may like to take a look at unipain by SO veteran Ned Batchelder, which is more Python-specific.

Tim Seed · Accepted Answer · 2016-01-04 03:10:32Z

0

Well - the most striking explanation is that your file is reading the data correctly.

As to why there is weird output - could be due to some many reasons

However it looks like you are using Python 2 (print statement) - And as the text is appearing as

CHARCHAR

I would assume that the file you are reading is UNICODE encoded text - so that ABC is witten \u0065\u0066\u0067

Either decode the byte string - until a Unicode string - or use Python 3 and look the Unicode issue.

answered Jan 4, 2016 at 3:10

Tim Seed

5,2872 gold badges32 silver badges27 bronze badges

Collectives™ on Stack Overflow

Python Read String from File with Strange Encoding

4 Answers 4

3 Comments

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related