There is a file with following contents:
b'prefix:input_text'
b'oEffect:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'
b'oEffect:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'
This is my try to read the lines and convert them to readable utf characters, but still it shows the same strings in the output file:
f = open(input_file, "rb")
for x in f:
inpcol.append(x.decode('utf-8'))
f = open(pred_file, "r")
for x in f:
predcol.append(x)
f = open(target_file, "r")
for x in f:
targcol.append(x)
data =[]
for i in tqdm(range(len(targcol))):
data.append([inpcol[i],targcol[i],predcol[i]])
pd.DataFrame(data,columns=["input_text","target_text","pred_text"]).to_csv(f"{path}/merge_{predfile}.csv", encoding="utf-8")
print("Done!")
The output file is:
,input_text,target_text,pred_text
0,"b'prefix:input_text'
","target_text
","ﺏﺭﺎﯾ ﺩﺮﮐ ﻮﻀﻌﯿﺗ
"
1,"b'xNeed:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf ___ \xd8\xa8\xd8\xa7\xd8\xb2\xdb\x8c \xd9\x85\xdb\x8c \xda\xa9\xd9\x86\xd8\xaf'
","ﺞﻨﮕﯾﺪﻧ
","ﺏﺭﺎﯾ ﭗﯾﺩﺍ ﮎﺭﺪﻧ ﯽﮐ ﺖﯿﻣ
"
As you see, the problem exists for input line but not for target and prediction lines (however scrambled but that's okay)
vim, they are just unicode chars. Its what vim shows. However, they are actually in Persian alphabet, something like علی به مدرسه رفتbytesstring literal syntax because of theb'prefix and ending'quote character. Perhaps you could put a copy of the file somewhere (like pastebin.com) and put a link to it into you question.