How can I read a byte array file of strings?

Question

There is a file with following contents:

b'prefix:input_text'
b'oEffect:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'
b'oEffect:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'

This is my try to read the lines and convert them to readable utf characters, but still it shows the same strings in the output file:

f = open(input_file, "rb")
for x in f:
  inpcol.append(x.decode('utf-8'))

f = open(pred_file, "r")
for x in f:
  predcol.append(x)

f = open(target_file, "r")
for x in f:
  targcol.append(x)
data =[]
for i in tqdm(range(len(targcol))):
  data.append([inpcol[i],targcol[i],predcol[i]])

pd.DataFrame(data,columns=["input_text","target_text","pred_text"]).to_csv(f"{path}/merge_{predfile}.csv", encoding="utf-8")
print("Done!")

The output file is:

,input_text,target_text,pred_text
0,"b'prefix:input_text'
","target_text
","ﺏﺭﺎﯾ ﺩﺮﮐ ﻮﻀﻌﯿﺗ
"
1,"b'xNeed:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'
","ﺞﻨﮕﯾﺪﻧ
","ﺏﺭﺎﯾ ﭗﯾﺩﺍ ﮎﺭﺪﻧ ﯽﮐ ﺖﯿﻣ
"

As you see, the problem exists for input line but not for target and prediction lines (however scrambled but that's okay)

The contents of the file are unclear. Please edit your question and copy its contents (from with in a text editor) then paste that into it. — martineau
– martineau, Commented May 19, 2021 at 17:41
I opened the file with vim, they are just unicode chars. Its what vim shows. However, they are actually in Persian alphabet, something like علی به مدرسه رفت — Ahmad
– Ahmad, Commented May 19, 2021 at 17:49
Then paste the Unicode characters from it into your question — because that is what must be read from the file. — martineau
– martineau, Commented May 19, 2021 at 17:51
@martineau They are as it shows, there is no diferrence. However, the target and prediction files are shown in Persian, but the input file is as is. — Ahmad
– Ahmad, Commented May 19, 2021 at 18:20
Well, it seems very odd that the contents of the file appears to be in Python bytes string literal syntax because of the b' prefix and ending ' quote character. Perhaps you could put a copy of the file somewhere (like pastebin.com) and put a link to it into you question. — martineau
– martineau, Commented May 19, 2021 at 18:38

furas · Accepted Answer · 2021-05-19 20:19:10Z

1

It seems someone wrote bytes in wrong way. Someone used str(bytes) instead of bytes.decode('utf-8'). Or maybe code was created for Python 2 which treats bytes and strings in different way then Python 3.

if you can correct code which write it then you have to fix text

text = "b'oEffect:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'"

crop b' '

text = text[2:-1]

convert back to bytes using special encoding 'raw_unicode_escape'

text = text.encode('raw_unicode_escape')

and convert to string correctly

text = text.decode()

And now

print(text)

gives me

oEffect:PersonX در جنگ ___ بازی می کند

EDIT:

It seems it has codes converted to strings with double slashes like b'\\xd8' but print() may display it as single slash but print(repr()) may show it with double slashes.

It may need more decode/encode to convert it correctly.

text = "b'xNeed:PersonX \\xd8\\xaf\\xd8\\xb1 \\xd8\\xac\\xd9\\x86\\xda\\xaf'"
print(repr(text))
print(text)

text = text[2:-1]
text = text.encode('raw_unicode_escape')
text = text.decode('unicode_escape')
text = text.encode('raw_unicode_escape')
text = text.decode()
print(text)

edited May 19, 2021 at 20:19

answered May 19, 2021 at 19:01

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Ahmad Over a year ago

It prints the expected output, however I am trying to do these in the loop of reading file, still not succeed, i try more

furas Over a year ago

first you can use print(..) and print(type(..)) to check what you have in variables when you read in loop. I can't test your file and code so I can't help more.

Ahmad Over a year ago

Where I should do these? I tried in the loop for reading file with no luck, I also tried in writing no luck

for i in tqdm(range(len(targcol))):       text = inpcol[i]       text = text[2:-1]       text = text.encode('raw_unicode_escape')       text = text.decode()       data.append([text,targcol[i],predcol[i]])

Ahmad Over a year ago

in read loop the type is str also note it prints each line without quotation, is it effective?

Ahmad Over a year ago

Thanks, but I didn't suceed! I don't know if you put them in a file you would succeed or not, anywyas, thanks, I vote it up

|

Collectives™ on Stack Overflow

How can I read a byte array file of strings?

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related