1

There is a file with following contents:

b'prefix:input_text'
b'oEffect:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'
b'oEffect:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'

This is my try to read the lines and convert them to readable utf characters, but still it shows the same strings in the output file:

f = open(input_file, "rb")
for x in f:
  inpcol.append(x.decode('utf-8'))

f = open(pred_file, "r")
for x in f:
  predcol.append(x)

f = open(target_file, "r")
for x in f:
  targcol.append(x)
data =[]
for i in tqdm(range(len(targcol))):
  data.append([inpcol[i],targcol[i],predcol[i]])

pd.DataFrame(data,columns=["input_text","target_text","pred_text"]).to_csv(f"{path}/merge_{predfile}.csv", encoding="utf-8")
print("Done!")

The output file is:

,input_text,target_text,pred_text
0,"b'prefix:input_text'
","target_text
","ﺏﺭﺎﯾ ﺩﺮﮐ ﻮﻀﻌﯿﺗ
"
1,"b'xNeed:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'
","ﺞﻨﮕﯾﺪﻧ
","ﺏﺭﺎﯾ ﭗﯾﺩﺍ ﮎﺭﺪﻧ ﯽﮐ ﺖﯿﻣ
"

As you see, the problem exists for input line but not for target and prediction lines (however scrambled but that's okay)

14
  • The contents of the file are unclear. Please edit your question and copy its contents (from with in a text editor) then paste that into it. Commented May 19, 2021 at 17:41
  • I opened the file with vim, they are just unicode chars. Its what vim shows. However, they are actually in Persian alphabet, something like علی به مدرسه رفت Commented May 19, 2021 at 17:49
  • Then paste the Unicode characters from it into your question — because that is what must be read from the file. Commented May 19, 2021 at 17:51
  • @martineau They are as it shows, there is no diferrence. However, the target and prediction files are shown in Persian, but the input file is as is. Commented May 19, 2021 at 18:20
  • Well, it seems very odd that the contents of the file appears to be in Python bytes string literal syntax because of the b' prefix and ending ' quote character. Perhaps you could put a copy of the file somewhere (like pastebin.com) and put a link to it into you question. Commented May 19, 2021 at 18:38

1 Answer 1

1

It seems someone wrote bytes in wrong way. Someone used str(bytes) instead of bytes.decode('utf-8'). Or maybe code was created for Python 2 which treats bytes and strings in different way then Python 3.


if you can correct code which write it then you have to fix text

text = "b'oEffect:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'"

crop b' '

text = text[2:-1]

convert back to bytes using special encoding 'raw_unicode_escape'

text = text.encode('raw_unicode_escape')

and convert to string correctly

text = text.decode()

And now

print(text)

gives me

oEffect:PersonX در جنگ ___ بازی می کند

EDIT:

It seems it has codes converted to strings with double slashes like b'\\xd8' but print() may display it as single slash but print(repr()) may show it with double slashes.

It may need more decode/encode to convert it correctly.

text = "b'xNeed:PersonX \\xd8\\xaf\\xd8\\xb1 \\xd8\\xac\\xd9\\x86\\xda\\xaf'"
print(repr(text))
print(text)

text = text[2:-1]
text = text.encode('raw_unicode_escape')
text = text.decode('unicode_escape')
text = text.encode('raw_unicode_escape')
text = text.decode()
print(text)
Sign up to request clarification or add additional context in comments.

11 Comments

It prints the expected output, however I am trying to do these in the loop of reading file, still not succeed, i try more
first you can use print(..) and print(type(..)) to check what you have in variables when you read in loop. I can't test your file and code so I can't help more.
Where I should do these? I tried in the loop for reading file with no luck, I also tried in writing no luck for i in tqdm(range(len(targcol))): text = inpcol[i] text = text[2:-1] text = text.encode('raw_unicode_escape') text = text.decode() data.append([text,targcol[i],predcol[i]])
in read loop the type is str also note it prints each line without quotation, is it effective?
Thanks, but I didn't suceed! I don't know if you put them in a file you would succeed or not, anywyas, thanks, I vote it up
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.