1

I have problem with comparing string from file with string I entered in the program, I should get that they are equal but no matter if i use decode('utf-8') I get that they are not equal. Here's the code:

final = open("info", 'r')
exported = open("final",'w')
lines = final.readlines()
for line in lines:
    if line == "Wykształcenie i praca": #error
    print "ok"

and how I save file that I try read:

comm_p = bs4.BeautifulSoup(comm)
comm_f.write(comm_p.prettify().encode('utf-8'))

for string in comm_p.strings:
      #print repr(string).encode('utf-8')
      save = string.encode('utf-8') #  there is how i save
      info.write(save)
      info.write("\n")        

info.close()

and at the top of file I have # -- coding: utf-8 --

Any ideas?

1
  • 1
    add print "%r %r" % (line, "Wykształcenie i praca") right before the comparison line and tell us what it says Commented Sep 24, 2012 at 7:49

4 Answers 4

3

This should do what you need:

# -- coding: utf-8 --
import io

with io.open('info', encoding='utf-8') as final:
    lines = final.readlines()

for line in lines:
    if line.strip() == u"Wykształcenie i praca": #error
        print "ok"

You need to open the file with the right encoding, and since your string is not ascii, you should mark it as unicode.

Sign up to request clarification or add additional context in comments.

Comments

0

First, you need some basic knowledge about encodings. This is a good place to start. You don't have to read everything right now, but try to get as far as you can.

About your current problem:

You're reading a UTF-8 encoded file (probably), but you're reading it as an ASCII file. open() doesn't do any conversion for you.

So what you need to do (at least):

  • use codecs.open("info", "r", encoding="utf-8") to read the file
  • use Unicode strings for comparison: if line.rstrip() == u"Wykształcenie i praca":

Comments

0

It is likely the difference is in a '\n' character

readlines doesn't strip '\n' - see Best method for reading newline delimited files in Python and discarding the newlines?

In general it is not a good idea to put a Unicode string in your code, it would be a good idea to read it from a resource file

1 Comment

you're right, it's difficult to notice that small mistake when you think that encoding causes error :P
0

use unicode for string comparision

>>> s = u'Wykształcenie i praca'
>>> s == u'Wykształcenie i praca'
True
>>>

when it comes to string unicode is the smartest move :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.