Encoding in python

Question

I have problem with comparing string from file with string I entered in the program, I should get that they are equal but no matter if i use decode('utf-8') I get that they are not equal. Here's the code:

final = open("info", 'r')
exported = open("final",'w')
lines = final.readlines()
for line in lines:
    if line == "Wykształcenie i praca": #error
    print "ok"

and how I save file that I try read:

comm_p = bs4.BeautifulSoup(comm)
comm_f.write(comm_p.prettify().encode('utf-8'))

for string in comm_p.strings:
      #print repr(string).encode('utf-8')
      save = string.encode('utf-8') #  there is how i save
      info.write(save)
      info.write("\n")        

info.close()

and at the top of file I have # -- coding: utf-8 --

Any ideas?

add print "%r %r" % (line, "Wykształcenie i praca") right before the comparison line and tell us what it says — georg
– georg, Commented Sep 24, 2012 at 7:49

Burhan Khalid · Accepted Answer · 2012-09-24 07:57:15Z

3

This should do what you need:

# -- coding: utf-8 --
import io

with io.open('info', encoding='utf-8') as final:
    lines = final.readlines()

for line in lines:
    if line.strip() == u"Wykształcenie i praca": #error
        print "ok"

You need to open the file with the right encoding, and since your string is not ascii, you should mark it as unicode.

answered Sep 24, 2012 at 7:57

Burhan Khalid

175k20 gold badges254 silver badges291 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Tim Pietzcker · Accepted Answer · 2012-09-24 07:54:50Z

0

First, you need some basic knowledge about encodings. This is a good place to start. You don't have to read everything right now, but try to get as far as you can.

About your current problem:

You're reading a UTF-8 encoded file (probably), but you're reading it as an ASCII file. open() doesn't do any conversion for you.

So what you need to do (at least):

use codecs.open("info", "r", encoding="utf-8") to read the file
use Unicode strings for comparison: if line.rstrip() == u"Wykształcenie i praca":

answered Sep 24, 2012 at 7:54

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:27:53Z

0

It is likely the difference is in a '\n' character

readlines doesn't strip '\n' - see Best method for reading newline delimited files in Python and discarding the newlines?

In general it is not a good idea to put a Unicode string in your code, it would be a good idea to read it from a resource file

edited May 23, 2017 at 12:27

CommunityBot

11 silver badge

answered Sep 24, 2012 at 7:50

Ofir

8,3972 gold badges32 silver badges44 bronze badges

1 Comment

adaniluk Over a year ago

you're right, it's difficult to notice that small mistake when you think that encoding causes error :P

Anuj · Accepted Answer · 2012-09-24 07:59:05Z

0

use unicode for string comparision

>>> s = u'Wykształcenie i praca'
>>> s == u'Wykształcenie i praca'
True
>>>

when it comes to string unicode is the smartest move :)

answered Sep 24, 2012 at 7:59

Anuj

9,6729 gold badges35 silver badges30 bronze badges

Collectives™ on Stack Overflow

Encoding in python

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related