I'm new to serious programming and I was trying to write a python program where I encountered strings in this form while reading from a file:
Îêåàí Åëüçè - Ìàéæå âåñíà
Ëÿïèñ Òðóáåöêîé - Ñâÿùåííûé Îãîíü
which is actually supposed to be in cyrillic (cp-1251), so this string is the victim of wrong encoding (I found it after long searching, with the help of this site:Universal Cyrillic Decoder)
Also using detect function in chardet module could find it
chardet.detect('Îêåàí Åëüçè - Ìàéæå âåñíà'.decode('utf-8').encode('windows-1252'))
which gives:
{'confidence': 0.7679697235616183, 'encoding': 'windows-1251'}
after doing the following I'm able to get the intended string
string.decode('utf-8').encode('windows-1252').decode('windows-1251').encode('utf-8')
which gives:
Океан Ельзи - Майже весна and
Коррозия Металла - Война Миров
respectively for the aforementioned strings.
My question is: Is there anyway to detect such strings? Here are some other strings which I haven't even found a way to correct:
Isao Sasaki - ¨¬¡Æ¨¬¡ÆAI¨¬¡Æ (A Different Farewell) (¡¾¢¬Cy¨ù¡¾ AU¡Æi)
Yoon K. Lee & Salzburg Kammerp - ³»¸¶À½
⁂晉䤠圠牥潂⁹䬨牡慭牴湯捩删浥硩䴠楡⥮
Ã�Ã�óôåõá üôé ï ãÃ�ìïò Ã�ôáÃ
ìéá áðë� õðüèåóç.
Much grateful for your replies.