4

I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).

2
  • 2
    Please give your questions more descriptive titles. A mod already had to fix your previous question: stackoverflow.com/questions/2257665/… Commented Feb 13, 2010 at 13:48
  • 1
    Sorry will keep it in mind from next time. Thank you Sir. Commented Feb 13, 2010 at 13:52

2 Answers 2

3

The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र'!).

All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:

  • each key must be an integer, the codepoint of a Unicode character; for example, 0x0904 is the codepoint for , AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf).

  • the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, e.g. u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).

Characters that aren't found as keys in the dict are passed on untouched from the input to the output.

Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (e.g., the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site).

Sign up to request clarification or add additional context in comments.

7 Comments

How would this transliterate from pa to ? Can str.translate take multiple characters and map them to one character? I always thought it was strictly for one-to-one mappings?
@Mark, your thoughts are correct: translate (both on str and unicode objects) works one character at a time -- "character by character", as the question says. It's hard to understand exactly what the OP needs -- in a comment to your A he says "existing Corpora in hindi which has to be romanized" (confirming the character by character transliteration suits) but elsewhere in the Q he says exactly the reverse. Depending on the OP's exact needs, your answer or mine may be preferable.
Yes, the question is very unclear. From the comments: "which has to be romanized to its original form from its transliterated English version" (emphasis mine) I understood it as from pa to . I found his use of "to" first and then "from" is very confusing - I'd always write "from X to Y" not "to Y from X".
@Mark, yes, but no doubt being a non-native speaker of English has to do with that (being a non-native speaker of English myself, I sympathize with the OP on this subject;-).
Respected Alex and Mark Sir, I am sorry for sounding ambiguous with my question, the dictionary that I am using is specific to input file, and thus has to transliterate accordingly, In certain cases it might have to do character by character like for e.g. E-k- in hindi after reversing the transliteration and bringing it back to its original form would be ऐक otherwise it should take two or more characters at one time for e.g if my mapping in dic is :- 'A' : u'अ' , 'AA' : u'आ ' how do I make sure that I get आ in the op and not अअ twice for a word like AAs-aan-ii.
|
0

Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.

Something like this:

for syllable in input_text.split_into_syllables():
    output_file.write(d[syllable])

Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.

4 Comments

Thank you for your support Sir:). I am trying to transliterate and not translate Sir, for e.g. 'पत्र' a word in hindi which means letter in english.. thats translation, but its transliteration is 'patra'
Your right Sir, that was one doubt I had in mind which was soon to be posted as a question, Do you suggest I use some other data type in Python to deal with this or could you please guide on how do i deal with this issue.. Thanks a lot for your time:) and btw Sir you almost got it right.. प is pa.. the other letter is tra
@user272398: If Hindi is anywhere near as irregular as English, I think your task will be very difficult to solve. There might be multiple characters that have the same transliteration, so you might not have a unique mapping, so you may need to use context to determine which character to map to. Perhaps you could always search for the next vowel to break the words up rather than using characters, but that might also might not always work. You've chosen an extremely difficult task for learning how to program.
I have kind of developed a mapping manually which is unique to deal with those multiple characters which have the same transliteration, I am doing this work on NLP as a part of my final year project Sir, I have existing Corpora in hindi which has to be romanized to its original form from its transliterated English version and I need to use this as an input file for further processing, The thing is I am not adept with Python, I kinda may be an intermediate programmer in C, C++ but since for scripts this is one of the best languages to use thus i am learning it side by side.:)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.