3

I made a pig latin translator that takes input from the user, translates it, and returns it. I want to add the ability to input a text file to take text from but I'm running into an issue that the file isn't being opened as I expect. Here is my code:

from sys import argv
script, filename = argv

file = open(filename, "r")

sentence = file.read()

print sentence

file.close()

The problem is that when I print out the information inside the file it looks like this:

■T h i s   i s   s o m e   t e x t   i n   a   f i l e

Instead of this:

This is some text in a file

I know I could do a workaround the spaces and the odd square character with slicing, but I feel like that is treating a symptom and I want to understand why the text is formatted weird so maybe I can fix the cause.

5
  • 1
    Hey, so I can be a bit more accurate in my answer, could you edit your post and put the result of hexdump <filename> and file <filename> from the command line? Assuming you aren't on Windows. Commented Jan 4, 2016 at 3:17
  • Or at least tell us the program you used to make that text file. Commented Jan 4, 2016 at 3:58
  • I used notepad++. As to doing the hexdump I will do that when I get home. Commented Jan 4, 2016 at 15:53
  • I tried doing the hexdump <filename> and it said the command hexdump is not recognized. Same happened with file. Commented Jan 5, 2016 at 16:18
  • @Supetorus: hexdump and file are standard commands on Unix-like systems, which is why Will said "Assuming you aren't on Windows". Commented Jan 6, 2016 at 6:09

4 Answers 4

4

I believe this is a Unicode UTF-16 encoded file, and this is the "Unicode Byte Order Mark" (BOM). It could also be another encoding with a byte-order mark, but it definitely appears to be a multi-byte encoding.

This is also why you're seeing the whitespace between characters. UTF-16 effectively represents each character as two bytes, but for standard ASCII characters like you're using, the other half of the character is empty (second byte is 0).

Try this instead:

from sys import argv
import codecs
script, filename = argv

file = codecs.open(filename, encoding='utf-16')
sentence = file.read()
print sentence
file.close()

Replace encoding='utf-16' with whatever encoding this actually is. You might just need to try a few and experiment.

Sign up to request clarification or add additional context in comments.

3 Comments

Typo: the other half of the character is zero. Your answer looks good, pity the OP is not responding.
Thanks! Corrected :) Yeah, I'm hoping OP can show the output of hexdump <file> and file <file> so we can figure out more clearly what the encoding is. I think SO converted it to UTF-8, and hexdumping it on my end doesn't show any known byte-order mark.
Aha. I checked my notepad++ settings and it creates files in UTF-8. I used the codecs stuff to fix it and it worked. Also I tried using hexdump and file as I told another guy in a comment above and both said the command is not recognized. Perhaps I didn't use it as you meant. I typed it directly into powershell as hexdump text.txt text.txt being the name of my file. Same with file.
2

The original file is UTF-16. Here's an example that writes a UTF-16 file and reads it with open vs. io.open, which takes an encoding parameter:

#!python2
import io

sentence = u'This is some text in a file'

with io.open('file.txt','w',encoding='utf16') as f:
    f.write(sentence)

with open('file.txt') as f:
    print f.read()

with io.open('file.txt','r',encoding='utf16') as f:
    print f.read()

Output on US Windows 7 console:

 ■T h i s   i s   s o m e   t e x t   i n   a   f i l e
This is some text in a file

As a guess, I'd say the OP created the text file in Windows Notepad and saved it as "Unicode", which is Microsoft's misnomer for UTF-16 encoding.

2 Comments

Great answer! What confused me is that when I tried to hexdump the BOM from the post text, it didn't seem to be a UTF-16 BOM. But I'm guessing that's just because SO uses UTF-8 :)
@Will, the post text was likely decoded in cp437, since that's what my terminal is, and the ■ character is FEh in that encoding. That's part of the UTF-16 BOM :)
1

At first when I saw everyone responding with stuff about unicode and utf I shied away from reading and trying to fix it, but I'm persistent about learning to program in python so I did some research, primarily this website. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

That was really helpful. So what I can gather is that notepad++ which I used to write the text file, wrote it in UTF-8, and python read it in UTF-16. The solution was to import codecs, and use the codecs function like this (as Will said above): from sys import argv import codecs

script, filename = argv

file = codecs.open(filename, encoding = "utf-8")

sentence = file.read()

print sentence

file.close()

1 Comment

Weird. UTF-8 shouldn't produce those spaces for chars in the ASCII range. But anyway... As well as that article by SO co-founder Joel, you may like to take a look at unipain by SO veteran Ned Batchelder, which is more Python-specific.
0

Well - the most striking explanation is that your file is reading the data correctly.

As to why there is weird output - could be due to some many reasons

However it looks like you are using Python 2 (print statement) - And as the text is appearing as

CHARCHAR

I would assume that the file you are reading is UNICODE encoded text - so that ABC is witten \u0065\u0066\u0067

Either decode the byte string - until a Unicode string - or use Python 3 and look the Unicode issue.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.